Commit Graph

315 Commits

Author SHA1 Message Date
Kir Kolyshkin
714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin
63854b0ea8 newSetnsProcess: reuse state.CgroupPaths
c.cgroupManager.GetPaths() are called twice here: once in currentState()
and then in newSetnsProcess(). Reuse the result of the first call, which
is stored into state.CgroupPaths.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:59 -07:00
Kir Kolyshkin
9a3e632625 notify: simplify usage
Instead of passing the whole map of paths, pass the path to the memory
controller which these functions actually require.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:58 -07:00
lifubang
657407ff23 fix runc events error in cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-07 22:18:46 +08:00
Ted Yu
db29dce076 Close fd in case fd.Write() returns error
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-02 20:06:08 -07:00
Mrunal Patel
634e51b52c Merge pull request #2335 from kolyshkin/cgroupv2-cpt
Fix cgroupv2 checkpoint/restore
2020-04-24 08:47:36 -07:00
Mrunal Patel
c420a3ec7f Merge pull request #2324 from kolyshkin/criu-freezer
libcontainer: fix Checkpoint wrt cgroupv2
2020-04-23 19:24:38 -07:00
Kir Kolyshkin
9280e3566d checkpoint/restore: fix cgroupv2 handling
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).

So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.

Without this fix, checkpoint integration tests fail on cgroup v2.

Also, same is true for cgroup v1 + cgroupns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:26:43 -07:00
Kir Kolyshkin
af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin
b3a481eb77 libcontainer: fix Checkpoint wrt cgroupv2
Commit 9a0184b10f meant to enable using cgroup v2 freezer
for criu >= 3.14, but it looks like it is doing something else
instead.

The logic here is:

 - for cgroup v1, set FreezeCgroup, if available
 - for cgroup v2, only set it for criu >= 3.14
 - do not use GetPaths() in case v2 is used
   (this method is obsoleted for v2 and will be removed)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-17 16:17:00 -07:00
Ted Yu
7a978e354a Defer netns.Close() after error check
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-15 18:33:20 -07:00
Ted Yu
21d7bb95eb Close criuServer so that even if CRIU crashes or unexpectedly exits, runc will not hang
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-03 15:27:27 -07:00
Kir Kolyshkin
b2272b2cba libcontainer: use errors.Is() and errors.As()
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.

The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 20:34:01 -07:00
Kir Kolyshkin
c39f87a47a Revert "Merge pull request #2280 from kolyshkin/errors-unwrap"
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().

This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 19:41:11 -07:00
Michael Crosby
f8e138855d Merge pull request #2280 from kolyshkin/errors-unwrap
Use errors.Unwrap() where possible
2020-04-02 14:39:06 -04:00
Michael Crosby
6ca9d8e6da Merge pull request #2283 from tedyu/runc-path-in-prefix
isPathInPrefixList return value should be reverted
2020-04-02 14:09:49 -04:00
Michael Crosby
b26e4f27c1 Merge pull request #2284 from tedyu/criu-svr-close
Avoid double close of criuServer
2020-04-02 14:07:35 -04:00
Mrunal Patel
e3e26cafe9 Merge pull request #2276 from kolyshkin/criu-v2
cgroupv2: don't use GetCgroupMounts for criu c/r
2020-04-01 17:36:24 -07:00
Ted Yu
49896ab0f4 Avoid double close of criuServer
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 16:15:23 -07:00
Ted Yu
d02fc48422 isPathInPrefixList return value should be reverted
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 15:45:31 -07:00
Kir Kolyshkin
8d7977ee6e libct/isPaused: don't use GetPaths from v2 code
Using GetPaths from cgroupv2 unified hierarchy code is deprecated
and this function will (hopefully) be removed.

Use GetUnifiedPath() for v2 case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:24:28 -07:00
Kir Kolyshkin
12e156f076 libct.isPaused: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin
fc840f199f cgroupv2: don't use GetCgroupMounts for criu c/r
When performing checkpoint or restore of cgroupv2 unified hierarchy,
there is no need to call getCgroupMounts() / cgroups.GetCgroupMounts()
as there's only a single mount in there.

This eliminates the last internal (i.e. runc) use case of
cgroups.GetCgroupMounts() for v2 unified. Unfortunately, there
are external ones (e.g. moby/moby) so we can't yet let it
return an error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 17:05:11 -07:00
Michael Crosby
9ec5b03e5a Merge pull request #2259 from adrianreber/v2-test
Add minimal cgroup2 checkpoint/restore support
2020-03-31 15:01:18 -04:00
Yulia Nedyalkova
2abc6a3605 Actually check for syscall.ENODEV when checking if a container is paused
It turns out that ioutil.Readfile wraps the error in a *os.PathError.
Since we cannot guarantee compilation with golang >= v1.13, we are
manually unwrapping the error.

Signed-off-by: Kieron Browne <kbrowne@pivotal.io>
2020-03-31 15:52:20 +01:00
Adrian Reber
9a0184b10f cgroup2: use CRIU's new freezer v2 support
The newest CRIU version supports freezer v2 and this tells runc
to use it if new enough or fall back to non-freezer based process
freezing on cgroup v2 system.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-03-31 16:36:35 +02:00
Michael Crosby
88474967d3 Merge pull request #1974 from openSUSE/unreachable-code
Remove unreachable code paths
2020-03-16 13:56:05 -04:00
Mrunal Patel
981dbef514 Merge pull request #2226 from avagin/runsc-restore-cmd-wait
restore: fix a race condition in process.Wait()
2020-03-15 18:48:16 -07:00
Sascha Grunert
b477a159db Remove unreachable code paths
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2020-03-12 09:13:03 +01:00
Pradyumna Agrawal
5b2b138d24 Synchronize the call to linuxContainer.Signal()
linuxContainer.Signal() can race with another call to say Destroy()
which clears the container's initProcess. This can cause a nil pointer
dereference in Signal().

This patch will synchronize Signal() and Destroy() by grabbing the
container's mutex as part of the Signal() call.

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-03-09 11:15:22 -07:00
Andrei Vagin
269ea385a4 restore: fix a race condition in process.Wait()
Adrian reported that the checkpoint test stated failing:
=== RUN   TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
    checkpoint_test.go:297: Did not restore the pipe correctly:

The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2020-02-10 10:21:08 -08:00
Aleksa Sarai
f6fb7a0338 merge branch 'pr-2133'
Julia Nedialkova (1):
  Handle ENODEV when accessing the freezer.state file

LGTMs: @crosbymichael @cyphar
Closes #2133
2020-01-17 02:07:19 +11:00
Jordan Liggitt
8541d9cf3d Fix race checking for process exit and waiting for exec fifo
Signed-off-by: Jordan Liggitt <liggitt@google.com>
2019-12-18 18:48:18 +00:00
Radostin Stoyanov
a610a84821 criu: Ensure other users cannot read c/r files
No checkpoint files should be readable by
anyone else but the user creating it.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-17 07:49:38 +01:00
Radostin Stoyanov
f017e0f9e1 checkpoint: Set descriptors.json file mode to 0600
Prevent unprivileged users from being able to read descriptors.json

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-12 19:29:44 +01:00
Julia Nedialkova
e63b797f38 Handle ENODEV when accessing the freezer.state file
...when checking if a container is paused

Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>
2019-09-27 17:02:56 +03:00
Michael Crosby
331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
Giuseppe Scrivano
1932917b71 libcontainer: add initial support for cgroups v2
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.

subsystemsUnified is used on a system running with cgroups v2 unified
mode.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:25 +02:00
Georgi Sabev
a146081828 Write logs to stderr by default
Minor refactoring to use the filePair struct for both init sock and log pipe

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:18:14 +03:00
Georgi Sabev
ba3cabf932 Improve nsexec logging
* Simplify logging function
* Logs contain __FUNCTION__:__LINE__
* Bail uses write_log

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:53:52 +03:00
Danail Branekov
c486e3c406 Address comments in PR 1861
Refactor configuring logging into a reusable component
so that it can be nicely used in both main() and init process init()

Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Claudia Beresford <cberesford@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2019-04-04 14:57:28 +03:00
Marco Vedovati
9a599f62fb Support for logging from children processes
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:23 +03:00
Mrunal Patel
2b18fe1d88 Merge pull request #1984 from cyphar/memfd-cleanups
nsenter: cloned_binary: "memfd" cleanups
2019-03-07 10:18:33 -08:00
Michael Crosby
f739110263 Merge pull request #1968 from adrianreber/podman
Create bind mount mountpoints during restore
2019-03-04 11:37:07 -06:00
Aleksa Sarai
af9da0a450 nsenter: cloned_binary: use the runc statedir for O_TMPFILE
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:51 +11:00
Adrian Reber
9edb5494bb Use vendored in CRIU Go bindings
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Adrian Reber
7354546cc8 Create mountpoints also on restore
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Adrian Reber
e157963054 Enable CRIU configuration files
CRIU 3.11 introduces configuration files:

https://criu.org/Configuration_files
https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html

This enables the user to influence CRIU's behaviour without code changes
if using new CRIU features or if the user wants to enable certain CRIU
behaviour without always specifying certain options.

With this it is possible to write 'tcp-established' to the configuration
file:

$ echo tcp-established > /etc/criu/runc.conf

and from now on all checkpoints will preserve the state of established
TCP connections. This removes the need to always use

$ runc checkpoint --tcp-stablished

If the goal is to always checkpoint with '--tcp-established'

It also adds the possibility for unexpected CRIU behaviour if the user
created a configuration file at some point in time and forgets about it.

As a result of the discussion in https://github.com/opencontainers/runc/pull/1933
it is now also possible to define a CRIU configuration file for each
container with the annotation 'org.criu.config'.

If 'org.criu.config' does not exist, runc will tell CRIU to use
'/etc/criu/runc.conf' if it exists.

If 'org.criu.config' is set to an empty string (''), runc will tell CRIU
to not use any runc specific configuration file at all.

If 'org.criu.config' is set to a non-empty string, runc will use that
value as an additional configuration file for CRIU.

With the annotation the user can decide to use the default configuration
file ('/etc/criu/runc.conf'), none or a container specific configuration
file.

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-12-21 07:42:12 +01:00
Michael Crosby
96ec2177ae Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers
kill: allow to signal paused containers
2018-12-03 16:55:13 -05:00
Ace-Tang
dce70cdff5 cr: get pid from criu notify when restore
when restore container from a checkpoint directory, we should get
pid from criu notify, since c.initProcess has not been created.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-12-03 13:31:20 +08:00