Commit Graph

284 Commits

Author SHA1 Message Date
Mrunal Patel
1079288bef Merge pull request #2902 from liusdu/checkpoint
checkpoint: resolve symlink for external bind mount
2021-06-24 22:52:02 -04:00
Mrunal Patel
245fe2b678 Merge pull request #3029 from liusdu/work
checkpoint: set default work-dir to image-path
2021-06-24 22:44:48 -04:00
Kir Kolyshkin
a7cfb23b88 *: stop using pkg/errors
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
56e478046a *: ignore errorlint warnings about unix.* errors
Errors from unix.* are always bare and thus can be used directly.

Add //nolint:errorlint annotation to ignore errors such as these:

libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
	case errno == unix.ERANGE:
	     ^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
					if e != unix.EINVAL {
					   ^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
			if err != unix.EINVAL && err != unix.EPERM {
			   ^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
			switch err {
			^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
36aefad45d libct: wrap unix.Mount/Unmount errors
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.

Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:37 -07:00
Kir Kolyshkin
627a06ad92 Replace fmt.Errorf w/o %-style to errors.New
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.

Found by

	git grep fmt.Errorf | grep -v ^vendor | grep -v '%'

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 11:42:07 -07:00
Liu Hua
85aabe233e C/R: let criu use its default if --work-path is not set
Now runc puts dump/restore logs in c.root defaultly, which will be deleted
when container exits. So if checkpinting/restoring failed, we can not get
these logs and analyze why.

This patch lets criu use its default if --work-path is not set:
 - Use WorkDirectory found in criu's configfile.
 - Use ImageDirectory.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2021-06-16 20:47:25 +08:00
Adrian Reber
535f25c44f Allow restoring with a different LSM profile
Restoring an SELinux enabled container with Podman will result in
a container with the exactly same SELinux process labels as during
checkpointing. CRIU takes care of all the process labels.

Restoring multiple copies of a checkpointed container will result in all
containers having the same SELinux process labels, which might be
undesired.

When looking at Pods all container in a Pod share the process label
of the infrastructure container. To restore a container into and
existing Pod it is necessary to tell CRIU to restore the container
with the infrastructure container process label.

CRIU supports setting different process labels using --lsm-profile for a
long time and this just passes the process label information from runc
to CRIU.

Unfortunately CRIU has a bug as no one was using the --lsm-profile
option so this changes requires the upcoming CRIU version 3.16.

Signed-off-by: Adrian Reber <areber@redhat.com>
2021-06-07 18:05:24 +02:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn
b45fbd43b8 errcheck: libcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:19:26 +02:00
Aleksa Sarai
0ca91f44f1 rootfs: add mount destination validation
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.

This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.

While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
container).

The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.

A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).

Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <champetier.etienne@gmail.com>
Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com>
Reviewed-by: Samuel Karp <skarp@amazon.com>
Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-19 16:58:35 +10:00
Kir Kolyshkin
3f65946756 libct/cg: make Set accept configs.Resources
A cgroup manager's Set method sets cgroup resources, but historically
it was accepting configs.Cgroups.

Refactor it to accept resources only. This is an improvement from the
API point of view, as the method can not change cgroup configuration
(such as path to the cgroup etc), it can only set (modify) its
resources/limits.

This also lays the foundation for complicated resource updates, as now
Set has two sets of resources -- the one that was previously specified
during cgroup manager creation (or the previous Set), and the one passed
in the argument, so it could deduce the difference between these. This
is a long term goal though.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-29 15:24:19 -07:00
Liu Hua
da22625f69 checkpoint: resolve symlink for external bind mount
runc resolves symlink before doing bind mount. So
we should save original path while formatting CriuReq for
checkpoint.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2021-04-25 09:50:00 +08:00
Kir Kolyshkin
ff692f289b Fix cgroup2 mount for rootless case
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440fb5 runc bind-mounts the whole
/sys/fs/cgroup into container.

Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.

The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.

Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.

Before:

	$ ./runc run aaa
	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup
	/sys/fs/cgroup/user.slice
	/sys/fs/cgroup/user.slice/user-1000.slice
	/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
	...
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 nobody   nogroup          0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	142 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory

After:

	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup/
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 root     root             0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	2 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	577536

[1] https://github.com/opencontainers/runc/issues/2158

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:35:40 -07:00
Kir Kolyshkin
deb8a8dd77 libct/newInitConfig: nit
Move the initialization of Console* fields as they are unconditional.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:30:02 -07:00
Akihiro Suda
c453f1a523 Merge pull request #2881 from kolyshkin/test-rand-cg
tests/int: some refactoring, fix a flake
2021-04-06 13:27:19 +09:00
Kir Kolyshkin
2dd62b3d42 libct/checkCriuFeatures: rm excessive debug
1. Remove printing criu args as now they are *always swrk 3.
2. Remove duplicated "feature check says" debug.

Before:

> DEBU[0000] Using CRIU with following args: [swrk 3]
> DEBU[0000] Using CRIU in FEATURE_CHECK mode
> DEBU[0000] Feature check says: type:FEATURE_CHECK success:true features:<mem_track:false lazy_pages:true >
> DEBU[0000] Feature check says: mem_track:false lazy_pages:true

After:

> DEBU[0000] Using CRIU in FEATURE_CHECK mode
> DEBU[0000] Feature check says: mem_track:false lazy_pages:true

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 12:57:41 -07:00
Kir Kolyshkin
bed4d89f57 Merge pull request #2807 from kolyshkin/google-golang-protobuf
go.mod, libct: switch to google.golang.org/protobuf
2021-03-31 20:34:16 -07:00
Kir Kolyshkin
b3be2b0b4f libct: close execFifo after start
Apparently, the parent never closes execFifo fd. Not a problem for runc
per se, but can be an issue for a user of libcontainer.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-30 19:58:09 -07:00
Kir Kolyshkin
201d60c51d runc run/start/exec: fix init log forwarding race
Sometimes debug.bats test cases are failing like this:

> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> #   `[[ "${output}" == *"child process in init()"* ]]' failed

It happens more when writing to disk.

This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.

ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.

So, to fix the failure described above, we need to:

 1. Make runc create/run/exec wait for ForwardLogs to finish;

 2. Make runc init close its log pipe file descriptor (i.e.
    the one which value is passed in _LIBCONTAINER_LOGPIPE
    environment variable).

This is exactly what this commit does:

 1. Amend ForwardLogs to return a channel, and wait for it in start().

 2. In runc init, save the log fd and close it as late as possible.

PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-25 19:18:55 -07:00
Akihiro Suda
0ae1475066 Merge pull request #2798 from adrianreber/2021-02-08-nested-bind.mounts
Correctly restore containers with nested bind mounts
2021-03-17 12:47:21 +09:00
Kir Kolyshkin
97f2e351a8 go.mod, libct: bump go-criu to v5, use google.golang.org/protobuf
Switch from github.com/golang/protobuf (which appears to be obsoleted)
to google.golang.org/protobuf (which appears to be a replacement).

This needs a bump to go-criu v5.

[v2: fix debug print in criuSwrk]
[v3: switch to go-criu v5]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-04 08:59:55 -08:00
Kir Kolyshkin
db025aba75 libct: criuSwrk: only iterate over CriuOpts if debug is set
In case log level is less than debug, this code does nothing,
so let's add a condition and skip it entirely.

Add a test case to make sure this code path is hit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-04 08:58:42 -08:00
Kir Kolyshkin
38b2dd391d runc exec: report possible OOM kill
An exec may fail due to memory shortage (cgroup memory limits being too
tight), and an error message provided in this case is clueless:

> $ sudo ../runc/runc exec xx56 top
> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer

Same as the previous commit for run/start, check the OOM kill counter
and report an OOM kill.

The differences from run are

1. The container is already running and OOM kill counter might not be
   zero.  This is why we have to read the counter before exec and after
   it failed.

2. An unrelated OOM kill event might occur in parallel with our exec
   (and I see no way to find out which process was killed, except to
   parse kernel logs which seems excessive and not very reliable).
   This is why we report _possible_ OOM kill.

With this commit, the error message looks like:

> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-23 16:16:33 -08:00
Adrian Reber
705b6cc723 Re-create mountpoints during restore
runc was already re-creating mountpoints before calling CRIU to restore
a container. But mountpoints inside a bind mount were not re-created.

During initial container creation runc will mount bind mounts and then
create the necessary mountpoints for further mounts inside those bind
mounts.

If, for example, one of the bind mounts is a tmpfs and empty before
restore, CRIU will fail re-mounting all mounts because the mountpoints
in the bind mounted tmpfs no longer exist.

CRIU expects all mount points to exist as during checkpointing.

This changes runc to mount bind mounts after mountpoint creation to
ensure nested bind mounts have their mountpoints created before CRIU
does the restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
2021-02-10 08:55:57 +01:00
Kir Kolyshkin
72f463891d libct: add TODO about os.ErrProcessDone
This is a new variable added by go 1.16 so we'll have to wait
until 1.16 is minimally supported version, thus TODO for now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-06 14:57:39 -08:00
Kir Kolyshkin
d7df30181b libct: suppress bogus "unable to terminate" warnings
While working on a test case for [1], I got the following warning:

> level=warning msg="unable to terminate initProcess" error="exit status 1"

Obviously, the warning is bogus since the initProcess is terminated.

This is happening because terminate() can return errors from either
Kill() or Wait(), and the latter returns an error if the process has
not finished successfully (i.e. exit status is not 0 or it was killed).

Check for a particular error type and filter out those errors.

[1] https://github.com/opencontainers/runc/issues/2683

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-06 14:57:39 -08:00
Kir Kolyshkin
6437086ef5 libct/addCriu*Mount: fix gosimple warning
> libcontainer/container_linux.go:768:2: S1017: should replace this `if` statement with an unconditional `strings.TrimPrefix` (gosimple)
> 	if strings.HasPrefix(mountDest, c.config.Rootfs) {
> 	^
> libcontainer/container_linux.go:1150:2: S1017: should replace this `if` statement with an unconditional `strings.TrimPrefix` (gosimple)
> 	if strings.HasPrefix(mountDest, c.config.Rootfs) {
> 	^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-12-03 10:24:27 -08:00
Kir Kolyshkin
d0b5954826 libct/checkCriuFeatures: fix gosimple linter warning
> libcontainer/container_linux.go:683:2: S1021: should merge variable declaration with assignment on next line (gosimple)
> 	var t criurpc.CriuReqType
> 	^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-12-03 10:24:27 -08:00
Kir Kolyshkin
11680cd2c7 libct: fix "unused" linter warning
Commit 4415446c32 introduces this function which is never used.
Remove it.

This fixes

> libcontainer/container_linux.go:1813:26: func `(*linuxContainer).deleteState` is unused (unused)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-12-03 10:24:27 -08:00
Akihiro Suda
fb59e6f726 Merge pull request #2583 from adrianreber/join-more-namespaces
restore: tell CRIU to use existing namespaces
2020-11-10 00:19:40 +09:00
Amim Knabben
978fa6e906 Fixing some lint issues
Signed-off-by: Amim Knabben <amim.knabben@gmail.com>
2020-10-06 14:44:14 -04:00
Akihiro Suda
890cc2aa60 Merge pull request #2612 from thaJeztah/concat
use string-concatenation instead of sprintf for simple cases
2020-10-01 01:56:28 +09:00
Sebastiaan van Stijn
8bf216728c use string-concatenation instead of sprintf for simple cases
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-30 10:51:59 +02:00
Kir Kolyshkin
a4d5e8a27b libcontainer/ignoreTerminateError: ignore SIGKILL
When we call terminate(), we kill the process, and wait
returns the error indicating the process was killed.
This is exactly what we expect here, so there is no reason
to treat it as an error.

Before this patch, when a container with invalid cgroup parameters is
started:

> WARN[0000] unable to terminate initProcess               error="signal: killed"
> ERRO[0000] container_linux.go:366: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "555": open /sys/fs/cgroup/blkio/user.slice/xx33/blkio.weight: permission denied

After:

> ERRO[0000] container_linux.go:366: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "555": open /sys/fs/cgroup/blkio/user.slice/xx33/blkio.weight: permission denied

I.e. the useless warning is gone.

NOTE this breaks a couple of integration test cases, since they were
expecting a particular message in the second line, and now due to
"signal: killed" removed it's in the first line. Fix those, too.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-09-29 23:55:02 -07:00
Kir Kolyshkin
dc42459130 libct/(*initProcess).start: fix removing cgroups on error
In case cgroup configuration is invalid (some parameters can't be set
etc.), p.manager.Set fails, the error is returned, and then we try to
remove cgroups (by calling p.manager.Destroy) in a defer.

The problem is, the container init is not yet killed (as it is killed in
the caller, i.e. (*linuxContainer).start), so cgroup removal fails like
this:

> time="2020-09-26T07:46:25Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod28ce6e74_694c_4b77_a953_dc01e182ac76.slice/crio-f6984c5eeb6c6b49ff3f036bdcb9ded317b3d0b2469ebbb35705442a2afd98c2.scope: device or resource busy"
> ...
> time="2020-09-26T07:46:27Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod28ce6e74_694c_4b77_a953_dc01e182ac76.slice/crio-f6984c5eeb6c6b49ff3f036bdcb9ded317b3d0b2469ebbb35705442a2afd98c2.scope: device or resource busy"

The above is repeated for every controller, and looks quite scary.

To fix, move the init termination to the abovementioned defer.

Do the same for (*setnsProcess).start() for uniformity.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-09-29 18:49:29 -07:00
Adrian Reber
2ccefa63e2 restore: tell CRIU to use existing namespaces
runc already tells CRIU to restore into an existing network or PID
namespace if there is a path to a namespace specified in config.json.

PID and network have special handling in CRIU using CRIU's inherit_fd
interface.

For UTS, IPC and MOUNT namespaces CRIU can join those existing
namespaces using CRIU's join_ns interface.

This is especially interesting for environments where containers are
running in a pod which already has running containers (pause for
example) with namespaces configured and the restored container needs to
join these namespaces.

CRIU has no support to join an existing CGROUP namespace (yet?) why
restoring a container with a path specified to a CGROUP namespace will
be aborted by runc.

CRIU would have support to restore a container into an existing time
namespace, but runc does not yet seem to support time namespaces.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-09-17 15:32:44 +02:00
Wei Fu
ba0246da75 libcontainer: Store state.json before sync procRun
If the procRun state has been synced and the runc-create process has
been killed for some reason, the runc-init[2:stage] process will be
leaky. And the runc command also fails to parse root directory because
the container doesn't have state.json.

In order to make it possible to clean the leaky runc-init[2:stage]
process , we should store the status before sync procRun.

```before
current workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
            <-- procRun

                ( killed for some reason)
                ( store state.json )
```

```expected
expected workflow:

[  child  ] <-> [   parent   ]

procHooks   --> [run hooks]
            <-- procResume

procReady   --> [final setup]
                store state.json
            <-- procRun
```

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2020-09-07 23:22:42 +08:00
Mrunal Patel
9ada2e6d4f Merge pull request #2539 from kolyshkin/ext-pidns-nits
external pidns c/r code nits
2020-08-17 11:41:46 -07:00
Akihiro Suda
234d15ecd0 Merge pull request #2520 from thaJeztah/bump_runtime_spec
vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
2020-08-04 14:05:33 +09:00
Akihiro Suda
78d02e8563 Merge pull request #2534 from adrianreber/go-criu-4-1-0
Pass location of CRIU binary to go-criu
2020-08-03 16:21:50 +09:00
Kir Kolyshkin
e54d1e4715 libct: initialize inheritFD in place
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-31 17:55:34 -07:00
Kir Kolyshkin
8b973997a4 libct: criuNsToKey doesn't have to be a method
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-31 17:52:09 -07:00
Adrian Reber
6f4616dd73 Pass location of CRIU binary to go-criu
If the CRIU binary is in a non $PATH location and passed to runc via
'--criu /path/to/criu', this information has not been passed to go-criu
and since the switch to use go-criu for CRIU version detection, non
$PATH CRIU usage was broken. This uses the newly added go-criu interface
to pass the location of the binary to go-criu.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-31 11:14:15 +02:00
Sebastiaan van Stijn
901dccf05d vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-30 22:08:54 +02:00
Adrian Reber
09e103b01e Tell CRIU to use an external pid namespace if necessary
Trying to checkpoint a container out of pod in cri-o fails with:

  Error (criu/namespaces.c:1081): Can't dump a pid namespace without the process init

Starting with the upcoming CRIU release 3.15, CRIU can be told to ignore
the PID namespace during checkpointing and to restore processes into an
existing network namespace.

With the changes from this commit and CRIU 3.15 it is possible to
checkpoint a container out of a pod in cri-o.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-27 10:14:08 +02:00
Adrian Reber
610c5ad75c Factor out checkpointing with external namespace code
To checkpoint and restore a container with an external network namespace
(like with Podman and CNI), runc tells CRIU to ignore the network
namespace during checkpoint and restore.

This commit moves that code to their own functions to be able to reuse
the same code path for external PID namespaces which are necessary for
checkpointing and restoring containers out of a pod in cri-o.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-27 10:14:07 +02:00
Kir Kolyshkin
108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Kir Kolyshkin
dff7685c18 Merge pull request #2459 from tedyu/linux-cont-set-cfg
Set configs back when intelrdt configs cannot be set

LGTMS: @AkihiroSuda @kolyshkin
2020-06-19 12:57:53 -07:00