zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-06 16:07:09 +08:00

Author	SHA1	Message	Date
Mrunal Patel	1079288bef	Merge pull request #2902 from liusdu/checkpoint checkpoint: resolve symlink for external bind mount	2021-06-24 22:52:02 -04:00
Mrunal Patel	245fe2b678	Merge pull request #3029 from liusdu/work checkpoint: set default work-dir to image-path	2021-06-24 22:44:48 -04:00
Kir Kolyshkin	a7cfb23b88	*: stop using pkg/errors Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	56e478046a	: ignore errorlint warnings about unix. errors Errors from unix.* are always bare and thus can be used directly. Add //nolint:errorlint annotation to ignore errors such as these: libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint) case errno == unix.ERANGE: ^ libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint) if e != unix.EINVAL { ^ libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint) if err != unix.EINVAL && err != unix.EPERM { ^ libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint) switch err { ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	36aefad45d	libct: wrap unix.Mount/Unmount errors Errors returned by unix are bare. In some cases it's impossible to find out what went wrong because there's is not enough context. Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount), and mount/unmount helpers. Use these where appropriate, and convert error checks to use errors.Is. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:37 -07:00
Kir Kolyshkin	627a06ad92	Replace fmt.Errorf w/o %-style to errors.New Using fmt.Errorf for errors that do not have %-style formatting directives is an overkill. Switch to errors.New. Found by git grep fmt.Errorf \| grep -v ^vendor \| grep -v '%' Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 11:42:07 -07:00
Liu Hua	85aabe233e	C/R: let criu use its default if --work-path is not set Now runc puts dump/restore logs in c.root defaultly, which will be deleted when container exits. So if checkpinting/restoring failed, we can not get these logs and analyze why. This patch lets criu use its default if --work-path is not set: - Use WorkDirectory found in criu's configfile. - Use ImageDirectory. Signed-off-by: Liu Hua <weldonliu@tencent.com>	2021-06-16 20:47:25 +08:00
Adrian Reber	535f25c44f	Allow restoring with a different LSM profile Restoring an SELinux enabled container with Podman will result in a container with the exactly same SELinux process labels as during checkpointing. CRIU takes care of all the process labels. Restoring multiple copies of a checkpointed container will result in all containers having the same SELinux process labels, which might be undesired. When looking at Pods all container in a Pod share the process label of the infrastructure container. To restore a container into and existing Pod it is necessary to tell CRIU to restore the container with the infrastructure container process label. CRIU supports setting different process labels using --lsm-profile for a long time and this just passes the process label information from runc to CRIU. Unfortunately CRIU has a bug as no one was using the --lsm-profile option so this changes requires the upcoming CRIU version 3.16. Signed-off-by: Adrian Reber <areber@redhat.com>	2021-06-07 18:05:24 +02:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn	b45fbd43b8	errcheck: libcontainer Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-05-20 14:19:26 +02:00
Aleksa Sarai	0ca91f44f1	rootfs: add mount destination validation Because the target of a mount is inside a container (which may be a volume that is shared with another container), there exists a race condition where the target of the mount may change to a path containing a symlink after we have sanitised the path -- resulting in us inadvertently mounting the path outside of the container. This is not immediately useful because we are in a mount namespace with MS_SLAVE mount propagation applied to "/", so we cannot mount on top of host paths in the host namespace. However, if any subsequent mountpoints in the configuration use a subdirectory of that host path as a source, those subsequent mounts will use an attacker-controlled source path (resolved within the host rootfs) -- allowing the bind-mounting of "/" into the container. While arguably configuration issues like this are not entirely within runc's threat model, within the context of Kubernetes (and possibly other container managers that provide semi-arbitrary container creation privileges to untrusted users) this is a legitimate issue. Since we cannot block mounting from the host into the container, we need to block the first stage of this attack (mounting onto a path outside the container). The long-term plan to solve this would be to migrate to libpathrs, but as a stop-gap we implement libpathrs-like path verification through readlink(/proc/self/fd/$n) and then do mount operations through the procfd once it's been verified to be inside the container. The target could move after we've checked it, but if it is inside the container then we can assume that it is safe for the same reason that libpathrs operations would be safe. A slight wrinkle is the "copyup" functionality we provide for tmpfs, which is the only case where we want to do a mount on the host filesystem. To facilitate this, I split out the copy-up functionality entirely so that the logic isn't interspersed with the regular tmpfs logic. In addition, all dependencies on m.Destination being overwritten have been removed since that pattern was just begging to be a source of more mount-target bugs (we do still have to modify m.Destination for tmpfs-copyup but we only do it temporarily). Fixes: CVE-2021-30465 Reported-by: Etienne Champetier <champetier.etienne@gmail.com> Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com> Reviewed-by: Samuel Karp <skarp@amazon.com> Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin) Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-05-19 16:58:35 +10:00
Kir Kolyshkin	3f65946756	libct/cg: make Set accept configs.Resources A cgroup manager's Set method sets cgroup resources, but historically it was accepting configs.Cgroups. Refactor it to accept resources only. This is an improvement from the API point of view, as the method can not change cgroup configuration (such as path to the cgroup etc), it can only set (modify) its resources/limits. This also lays the foundation for complicated resource updates, as now Set has two sets of resources -- the one that was previously specified during cgroup manager creation (or the previous Set), and the one passed in the argument, so it could deduce the difference between these. This is a long term goal though. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-29 15:24:19 -07:00
Liu Hua	da22625f69	checkpoint: resolve symlink for external bind mount runc resolves symlink before doing bind mount. So we should save original path while formatting CriuReq for checkpoint. Signed-off-by: Liu Hua <weldonliu@tencent.com>	2021-04-25 09:50:00 +08:00
Kir Kolyshkin	ff692f289b	Fix cgroup2 mount for rootless case In case of rootless, cgroup2 mount is not possible (see [1] for more details), so since commit `9c81440fb5` runc bind-mounts the whole /sys/fs/cgroup into container. Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container is supposed to show the cgroup files for this cgroup, not the root one. The fix is to pass through and use the cgroup path in case cgroup2 mount failed, cgroupns is enabled, and the path is non-empty. Surely this requires the /sys/fs/cgroup mount in the spec, so modify runc spec --rootless to keep it. Before: $ ./runc run aaa # find /sys/fs/cgroup/ -type d /sys/fs/cgroup /sys/fs/cgroup/user.slice /sys/fs/cgroup/user.slice/user-1000.slice /sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service ... # ls -l /sys/fs/cgroup/cgroup.controllers -r--r--r-- 1 nobody nogroup 0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers # wc -w /sys/fs/cgroup/cgroup.procs 142 /sys/fs/cgroup/cgroup.procs # cat /sys/fs/cgroup/memory.current cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory After: # find /sys/fs/cgroup/ -type d /sys/fs/cgroup/ # ls -l /sys/fs/cgroup/cgroup.controllers -r--r--r-- 1 root root 0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers # wc -w /sys/fs/cgroup/cgroup.procs 2 /sys/fs/cgroup/cgroup.procs # cat /sys/fs/cgroup/memory.current 577536 [1] https://github.com/opencontainers/runc/issues/2158 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-20 12:35:40 -07:00
Kir Kolyshkin	deb8a8dd77	libct/newInitConfig: nit Move the initialization of Console* fields as they are unconditional. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-20 12:30:02 -07:00
Akihiro Suda	c453f1a523	Merge pull request #2881 from kolyshkin/test-rand-cg tests/int: some refactoring, fix a flake	2021-04-06 13:27:19 +09:00
Kir Kolyshkin	2dd62b3d42	libct/checkCriuFeatures: rm excessive debug 1. Remove printing criu args as now they are *always swrk 3. 2. Remove duplicated "feature check says" debug. Before: > DEBU[0000] Using CRIU with following args: [swrk 3] > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: type:FEATURE_CHECK success:true features:<mem_track:false lazy_pages:true > > DEBU[0000] Feature check says: mem_track:false lazy_pages:true After: > DEBU[0000] Using CRIU in FEATURE_CHECK mode > DEBU[0000] Feature check says: mem_track:false lazy_pages:true Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-01 12:57:41 -07:00
Kir Kolyshkin	bed4d89f57	Merge pull request #2807 from kolyshkin/google-golang-protobuf go.mod, libct: switch to google.golang.org/protobuf	2021-03-31 20:34:16 -07:00
Kir Kolyshkin	b3be2b0b4f	libct: close execFifo after start Apparently, the parent never closes execFifo fd. Not a problem for runc per se, but can be an issue for a user of libcontainer. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-30 19:58:09 -07:00
Kir Kolyshkin	201d60c51d	runc run/start/exec: fix init log forwarding race Sometimes debug.bats test cases are failing like this: > not ok 27 global --debug to --log --log-format 'json' > # (in test file tests/integration/debug.bats, line 77) > # `[[ "${output}" == "child process in init()" ]]' failed It happens more when writing to disk. This issue is caused by the fact that runc spawns log forwarding goroutine (ForwardLogs) but does not wait for it to finish, resulting in missing debug lines from nsexec. ForwardLogs itself, though, never finishes, because it reads from a reading side of a pipe which writing side is not closed. This is especially true in case of runc create, which spawns runc init and exits; meanwhile runc init waits on exec fifo for arbitrarily long time before doing execve. So, to fix the failure described above, we need to: 1. Make runc create/run/exec wait for ForwardLogs to finish; 2. Make runc init close its log pipe file descriptor (i.e. the one which value is passed in _LIBCONTAINER_LOGPIPE environment variable). This is exactly what this commit does: 1. Amend ForwardLogs to return a channel, and wait for it in start(). 2. In runc init, save the log fd and close it as late as possible. PS I have to admit I still do not understand why an explicit close of log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before the execve which (thanks to CLOEXEC) closes the fd anyway. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-25 19:18:55 -07:00
Akihiro Suda	0ae1475066	Merge pull request #2798 from adrianreber/2021-02-08-nested-bind.mounts Correctly restore containers with nested bind mounts	2021-03-17 12:47:21 +09:00
Kir Kolyshkin	97f2e351a8	go.mod, libct: bump go-criu to v5, use google.golang.org/protobuf Switch from github.com/golang/protobuf (which appears to be obsoleted) to google.golang.org/protobuf (which appears to be a replacement). This needs a bump to go-criu v5. [v2: fix debug print in criuSwrk] [v3: switch to go-criu v5] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-04 08:59:55 -08:00
Kir Kolyshkin	db025aba75	libct: criuSwrk: only iterate over CriuOpts if debug is set In case log level is less than debug, this code does nothing, so let's add a condition and skip it entirely. Add a test case to make sure this code path is hit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-04 08:58:42 -08:00
Kir Kolyshkin	38b2dd391d	runc exec: report possible OOM kill An exec may fail due to memory shortage (cgroup memory limits being too tight), and an error message provided in this case is clueless: > $ sudo ../runc/runc exec xx56 top > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer Same as the previous commit for run/start, check the OOM kill counter and report an OOM kill. The differences from run are 1. The container is already running and OOM kill counter might not be zero. This is why we have to read the counter before exec and after it failed. 2. An unrelated OOM kill event might occur in parallel with our exec (and I see no way to find out which process was killed, except to parse kernel logs which seems excessive and not very reliable). This is why we report _possible_ OOM kill. With this commit, the error message looks like: > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-02-23 16:16:33 -08:00
Adrian Reber	705b6cc723	Re-create mountpoints during restore runc was already re-creating mountpoints before calling CRIU to restore a container. But mountpoints inside a bind mount were not re-created. During initial container creation runc will mount bind mounts and then create the necessary mountpoints for further mounts inside those bind mounts. If, for example, one of the bind mounts is a tmpfs and empty before restore, CRIU will fail re-mounting all mounts because the mountpoints in the bind mounted tmpfs no longer exist. CRIU expects all mount points to exist as during checkpointing. This changes runc to mount bind mounts after mountpoint creation to ensure nested bind mounts have their mountpoints created before CRIU does the restore. Signed-off-by: Adrian Reber <areber@redhat.com>	2021-02-10 08:55:57 +01:00
Kir Kolyshkin	72f463891d	libct: add TODO about os.ErrProcessDone This is a new variable added by go 1.16 so we'll have to wait until 1.16 is minimally supported version, thus TODO for now. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-01-06 14:57:39 -08:00
Kir Kolyshkin	d7df30181b	libct: suppress bogus "unable to terminate" warnings While working on a test case for [1], I got the following warning: > level=warning msg="unable to terminate initProcess" error="exit status 1" Obviously, the warning is bogus since the initProcess is terminated. This is happening because terminate() can return errors from either Kill() or Wait(), and the latter returns an error if the process has not finished successfully (i.e. exit status is not 0 or it was killed). Check for a particular error type and filter out those errors. [1] https://github.com/opencontainers/runc/issues/2683 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-01-06 14:57:39 -08:00
Kir Kolyshkin	6437086ef5	libct/addCriu*Mount: fix gosimple warning > libcontainer/container_linux.go:768:2: S1017: should replace this `if` statement with an unconditional `strings.TrimPrefix` (gosimple) > if strings.HasPrefix(mountDest, c.config.Rootfs) { > ^ > libcontainer/container_linux.go:1150:2: S1017: should replace this `if` statement with an unconditional `strings.TrimPrefix` (gosimple) > if strings.HasPrefix(mountDest, c.config.Rootfs) { > ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-03 10:24:27 -08:00
Kir Kolyshkin	d0b5954826	libct/checkCriuFeatures: fix gosimple linter warning > libcontainer/container_linux.go:683:2: S1021: should merge variable declaration with assignment on next line (gosimple) > var t criurpc.CriuReqType > ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-03 10:24:27 -08:00
Kir Kolyshkin	11680cd2c7	libct: fix "unused" linter warning Commit `4415446c32` introduces this function which is never used. Remove it. This fixes > libcontainer/container_linux.go:1813:26: func `(*linuxContainer).deleteState` is unused (unused) Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-03 10:24:27 -08:00
Akihiro Suda	fb59e6f726	Merge pull request #2583 from adrianreber/join-more-namespaces restore: tell CRIU to use existing namespaces	2020-11-10 00:19:40 +09:00
Amim Knabben	978fa6e906	Fixing some lint issues Signed-off-by: Amim Knabben <amim.knabben@gmail.com>	2020-10-06 14:44:14 -04:00
Akihiro Suda	890cc2aa60	Merge pull request #2612 from thaJeztah/concat use string-concatenation instead of sprintf for simple cases	2020-10-01 01:56:28 +09:00
Sebastiaan van Stijn	8bf216728c	use string-concatenation instead of sprintf for simple cases Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-09-30 10:51:59 +02:00
Kir Kolyshkin	a4d5e8a27b	libcontainer/ignoreTerminateError: ignore SIGKILL When we call terminate(), we kill the process, and wait returns the error indicating the process was killed. This is exactly what we expect here, so there is no reason to treat it as an error. Before this patch, when a container with invalid cgroup parameters is started: > WARN[0000] unable to terminate initProcess error="signal: killed" > ERRO[0000] container_linux.go:366: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "555": open /sys/fs/cgroup/blkio/user.slice/xx33/blkio.weight: permission denied After: > ERRO[0000] container_linux.go:366: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write "555": open /sys/fs/cgroup/blkio/user.slice/xx33/blkio.weight: permission denied I.e. the useless warning is gone. NOTE this breaks a couple of integration test cases, since they were expecting a particular message in the second line, and now due to "signal: killed" removed it's in the first line. Fix those, too. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-29 23:55:02 -07:00
Kir Kolyshkin	dc42459130	libct/(initProcess).start: fix removing cgroups on error In case cgroup configuration is invalid (some parameters can't be set etc.), p.manager.Set fails, the error is returned, and then we try to remove cgroups (by calling p.manager.Destroy) in a defer. The problem is, the container init is not yet killed (as it is killed in the caller, i.e. (linuxContainer).start), so cgroup removal fails like this: > time="2020-09-26T07:46:25Z" level=warning msg="Failed to remove cgroup (will retry)" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod28ce6e74_694c_4b77_a953_dc01e182ac76.slice/crio-f6984c5eeb6c6b49ff3f036bdcb9ded317b3d0b2469ebbb35705442a2afd98c2.scope: device or resource busy" > ... > time="2020-09-26T07:46:27Z" level=error msg="Failed to remove cgroup" error="rmdir /sys/fs/cgroup/net_cls,net_prio/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod28ce6e74_694c_4b77_a953_dc01e182ac76.slice/crio-f6984c5eeb6c6b49ff3f036bdcb9ded317b3d0b2469ebbb35705442a2afd98c2.scope: device or resource busy" The above is repeated for every controller, and looks quite scary. To fix, move the init termination to the abovementioned defer. Do the same for (*setnsProcess).start() for uniformity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-29 18:49:29 -07:00
Adrian Reber	2ccefa63e2	restore: tell CRIU to use existing namespaces runc already tells CRIU to restore into an existing network or PID namespace if there is a path to a namespace specified in config.json. PID and network have special handling in CRIU using CRIU's inherit_fd interface. For UTS, IPC and MOUNT namespaces CRIU can join those existing namespaces using CRIU's join_ns interface. This is especially interesting for environments where containers are running in a pod which already has running containers (pause for example) with namespaces configured and the restored container needs to join these namespaces. CRIU has no support to join an existing CGROUP namespace (yet?) why restoring a container with a path specified to a CGROUP namespace will be aborted by runc. CRIU would have support to restore a container into an existing time namespace, but runc does not yet seem to support time namespaces. Signed-off-by: Adrian Reber <areber@redhat.com>	2020-09-17 15:32:44 +02:00
Wei Fu	ba0246da75	libcontainer: Store state.json before sync procRun If the procRun state has been synced and the runc-create process has been killed for some reason, the runc-init[2:stage] process will be leaky. And the runc command also fails to parse root directory because the container doesn't have state.json. In order to make it possible to clean the leaky runc-init[2:stage] process , we should store the status before sync procRun. ```before current workflow: [ child ] <-> [ parent ] procHooks --> [run hooks] <-- procResume procReady --> [final setup] <-- procRun ( killed for some reason) ( store state.json ) ``` ```expected expected workflow: [ child ] <-> [ parent ] procHooks --> [run hooks] <-- procResume procReady --> [final setup] store state.json <-- procRun ``` Signed-off-by: Wei Fu <fuweid89@gmail.com>	2020-09-07 23:22:42 +08:00
Mrunal Patel	9ada2e6d4f	Merge pull request #2539 from kolyshkin/ext-pidns-nits external pidns c/r code nits	2020-08-17 11:41:46 -07:00
Akihiro Suda	234d15ecd0	Merge pull request #2520 from thaJeztah/bump_runtime_spec vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6	2020-08-04 14:05:33 +09:00
Akihiro Suda	78d02e8563	Merge pull request #2534 from adrianreber/go-criu-4-1-0 Pass location of CRIU binary to go-criu	2020-08-03 16:21:50 +09:00
Kir Kolyshkin	e54d1e4715	libct: initialize inheritFD in place Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-31 17:55:34 -07:00
Kir Kolyshkin	8b973997a4	libct: criuNsToKey doesn't have to be a method Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-31 17:52:09 -07:00
Adrian Reber	6f4616dd73	Pass location of CRIU binary to go-criu If the CRIU binary is in a non $PATH location and passed to runc via '--criu /path/to/criu', this information has not been passed to go-criu and since the switch to use go-criu for CRIU version detection, non $PATH CRIU usage was broken. This uses the newly added go-criu interface to pass the location of the binary to go-criu. Signed-off-by: Adrian Reber <areber@redhat.com>	2020-07-31 11:14:15 +02:00
Sebastiaan van Stijn	901dccf05d	vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6 Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-07-30 22:08:54 +02:00
Adrian Reber	09e103b01e	Tell CRIU to use an external pid namespace if necessary Trying to checkpoint a container out of pod in cri-o fails with: Error (criu/namespaces.c:1081): Can't dump a pid namespace without the process init Starting with the upcoming CRIU release 3.15, CRIU can be told to ignore the PID namespace during checkpointing and to restore processes into an existing network namespace. With the changes from this commit and CRIU 3.15 it is possible to checkpoint a container out of a pod in cri-o. Signed-off-by: Adrian Reber <areber@redhat.com>	2020-07-27 10:14:08 +02:00
Adrian Reber	610c5ad75c	Factor out checkpointing with external namespace code To checkpoint and restore a container with an external network namespace (like with Podman and CNI), runc tells CRIU to ignore the network namespace during checkpoint and restore. This commit moves that code to their own functions to be able to reuse the same code path for external PID namespaces which are necessary for checkpointing and restoring containers out of a pod in cri-o. Signed-off-by: Adrian Reber <areber@redhat.com>	2020-07-27 10:14:07 +02:00
Kir Kolyshkin	108ee85b82	libct/cgroups: add SkipDevices to Resources The kubelet uses libct/cgroups code to set up cgroups. It creates a parent cgroup (kubepods) to put the containers into. The problem (for cgroupv2 that uses eBPF for device configuration) is the hard requirement to have devices cgroup configured results in leaking an eBPF program upon every kubelet restart. program. If kubelet is restarted 64+ times, the cgroup can't be configured anymore. Work around this by adding a SkipDevices flag to Resources. A check was added so that if SkipDevices is set, such a "container" can't be started (to make sure it is only used for non-containers). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-02 15:19:31 -07:00
Kir Kolyshkin	dff7685c18	Merge pull request #2459 from tedyu/linux-cont-set-cfg Set configs back when intelrdt configs cannot be set LGTMS: @AkihiroSuda @kolyshkin	2020-06-19 12:57:53 -07:00

1 2 3 4 5 ...

284 Commits