zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-12 19:01:55 +08:00

Author	SHA1	Message	Date
lifubang	f07d92dbcd	drop runc-dmz solution according to overlay solution Because we have the overlay solution, we can drop runc-dmz binary solution since it has too many limitations. Signed-off-by: lifubang <lifubang@acmcoder.com> (cherry picked from commit `871057d863`) Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-10-29 17:52:12 +08:00
Kir Kolyshkin	f479676cce	libct: rm x/sys/execabs usage Since Go 1.19, the same functionality is there in os/exec package. As we require go 1.22 now, there's no need to have this. This basically reverts commit `9258eac0` ("libct/start: use execabs for newuidmap lookup"). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> (cherry picked from commit `eb2ff52ace`) Signed-off-by: Austin Vazquez <macedonv@amazon.com>	2024-10-26 18:27:18 +00:00
lfbzhm	e669926691	fix an error caused by fd reuse race when starting runc init There is a race situation when we are opening a file, if there is a small fd was closed at that time, maybe it will be reused by safeExe. Because of Go stdlib fds shuffling bug, if the fd of safeExe is too small, go stdlib will dup3 it to another fd, or dup3 a other fd to this fd, then it will cause the fd type cmd.Path refers to a random path, and it can lead to an error "permission denied" when starting the process. Please see #4294 and <https://github.com/golang/go/issues/61751>. So we should not use the original fd of safeExe, but use the fd after shuffled by Go stdlib. Because Go stdlib will guarantee this fd refers to the correct file. Signed-off-by: lfbzhm <lifubang@acmcoder.com>	2024-10-21 06:53:44 +00:00
lifubang	10c951e335	add ErrCgroupNotExist For some rootless container, runc has no access to cgroup, But the container is still running. So we should return the `ErrNotRunning` and `ErrCgroupNotExist` error seperatlly. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-09-23 23:27:35 +00:00
Akihiro Suda	429e06a518	libct: Signal: honor RootlessCgroups `signalAllProcesses()` depends on the cgroup and is expected to fail when runc is running in rootless without an access to the cgroup. When `RootlessCgroups` is set to `true`, runc just ignores the error from `signalAllProcesses` and may leak some processes running. (See the comments in PR 4395) In the future, runc should walk the process tree to avoid such a leak. Note that `RootlessCgroups` is a misnomer; it is set to `false` despite the name when cgroup v2 delegation is configured. This is expected to be renamed in a separate commit. Fix issue 4394 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2024-09-11 03:54:52 +09:00
Kir Kolyshkin	bb2db7b4fd	libct: drop error from (*Container).currentState return This function never returns error since 2016 (commit `556f798a`), so let's remove it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-07-10 16:24:36 -07:00
Kir Kolyshkin	c8395b6e53	Enable govet nilness, fix an issue The code already checked if err == nil above, so the linter complains: > libcontainer/container_linux.go:534:18: nilness: tautological condition: non-nil != nil (govet) > } else if err != nil { > ^ Fix the issue, enable the check. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-07-10 16:04:38 -07:00
Kir Kolyshkin	ab010ae1ef	Merge pull request #4319 from lifubang/fix-execfifo-delete-error Try to delete exec fifo file when failure in creation	2024-06-26 11:53:34 -07:00
Kir Kolyshkin	42cea2ecb4	libct: don't allow to start second init process By definition, every container has only 1 init (i.e. PID 1) process. Apparently, libcontainer API supported running more than 1 init, and at least one tests mistakenly used it. Let's not allow that, erroring out if we already have init. Doing otherwise _probably_ results in some confusion inside the library. Fix two cases in libct/int which ran two inits inside a container. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-10 22:30:03 -07:00
Kir Kolyshkin	e3e1072575	libct: fix locking in Start/Run/Exec 1. The code to call c.exec from c.Run was initially added by commit `3aacff695`. At the time, there was a lock in c.Run. That lock was removed by commit `bd3c4f84`, which resulted in part of c.Run executing without the lock. 2. All the Start/Run/Exec calls were a mere wrappers for start/run/exec adding a lock, but some more code crept into Start at some point, e.g. by commits `805b8c73` and `108ee85b8`. Since the reason mentioned in commit `805b8c73` is no longer true after refactoring, we can fix this. Fix both issues by moving code out of wrappers, and adding locking into c.Run. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-10 22:30:03 -07:00
Kir Kolyshkin	304a4c0fee	libct: createExecFifo: rm unneeded os.Stat In case file already exists, mknod(2) will return EEXIST. This os.Stat call was (inadvertently?) added by commit `805b8c73`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-10 22:30:03 -07:00
lifubang	e7294527e2	try to delete exec fifo file when failure in creation Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-06-10 19:51:48 +08:00
Kir Kolyshkin	b7fdd524cb	libct: use slices package As we're no longer supporting Go < 1.21. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-07 10:18:59 -07:00
Aleksa Sarai	37581ad340	dmz: remove SELinux special-casing Now that runc-dmz is opt-in, we no longer need to try to detect whether SELinux would cause issues for us. We can also remove the special-purpose build-tag we added. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-03-13 18:18:09 +11:00
Aleksa Sarai	ee73091a8d	libcontainer: mark all non-stdio fds O_CLOEXEC before spawning init Given the core issue in GHSA-xr7r-f8xq-vfvv was that we were unknowingly leaking file descriptors to "runc init", it seems prudent to make sure we proactively prevent this in the future. The solution is to simply mark all non-stdio file descriptors as O_CLOEXEC before we spawn "runc init". For libcontainer library users, this could result in unrelated files being marked as O_CLOEXEC -- however (for the same reason we are doing this for runc), for security reasons those files should've been marked as O_CLOEXEC anyway. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:59 +11:00
lfbzhm	371ff9c5e7	Merge pull request #3985 from cyphar/idmap-generic libcontainer: remove all mount logic from nsexec	2023-12-18 13:10:45 +08:00
Aleksa Sarai	482e56379a	configs: make id mappings int64 to better handle 32-bit Using ints for all of our mapping structures means that a 32-bit binary errors out when trying to parse /proc/self/id_map: failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user: parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295 This issue was unearthed by commit `1912d5988b` (": actually support joining a userns with a new container") but the underlying issue has been present since the docker/libcontainer days. In theory, switching to uint32 (to match the spec) instead of int64 would also work, but keeping everything signed seems much less error-prone. It's also important to note that a mapping might be too large for an int on 32-bit, so we detect this during the mapping. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 12:14:32 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Kir Kolyshkin	7396ca90fa	runc delete: do not ignore error from destroy If container.Destroy() has failed, runc destroy still return 0, which is wrong and can result in other issues down the line. Let's always return error from destroy in runc delete. For runc checkpoint and runc run, we still treat it as a warning. Co-authored-by: Zhang Tianyang <burning9699@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
Kir Kolyshkin	dcf1b731f5	runc kill: fix sending KILL to non-pidns container Commit `f8ad20f` made it impossible to kill leftover processes in a stopped container that does not have its own PID namespace. In other words, if a container init is gone, it is no longer possible to use `runc kill` to kill the leftover processes. Fix this by moving the check if container init exists to after the special case of handling the container without own PID namespace. While at it, fix the minor issue introduced by commit `9583b3d`: if signalAllProcesses is used, there is no need to thaw the container (as freeze/thaw is either done in signalAllProcesses already, or not needed at all). Also, make signalAllProcesses return an error early if the container cgroup does not exist (as it relies on it to do its job). This way, the error message returned is more generic and easier to understand ("container not running" instead of "can't open file"). Finally, add a test case. Fixes: `f8ad20f` Fixes: `9583b3d` Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
Kir Kolyshkin	542cce0122	libct: Signal: slight refactor Let's use c.hasInit and c.isPaused where needed instead of c.curentStatus for simplicity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
Kir Kolyshkin	d9f2a24a5b	libct: replace runType with hasInit The semantics of runType is slightly complicated, and the only place where we need to distinguish between Created and Running is refreshState. Replace runType with simpler hasInit, simplifying its users (except the refreshState, which now figures out on its own whether the container is Created or Running). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
lfbzhm	95a93c132c	Merge pull request #4045 from fuweid/support-pidfd-socket [feature request] *: introduce pidfd-socket flag	2023-11-22 09:13:55 +08:00
Wei Fu	94505a046a	*: introduce pidfd-socket flag The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-11-21 18:28:50 +08:00
Kir Kolyshkin	87bd784614	Add dmz-vs-selinux kludge and a way to disable it Add a workaround for a problem of older container-selinux not allowing runc to use dmz feature. If runc sees that SELinux is in enforced mode and the container's SELinux label is set, it disables dmz. Add a build tag, runc_dmz_selinux_nocompat, which disables the workaround. Newer distros that ship container-selinux >= 2.224.0 (currently CentOS Stream 8 and 9, RHEL 8 and 9, and Fedora 38+) may build runc with this build tag set to benefit from dmz working with SELinux. Document the build tag in the top-level and libct/dmz READMEs. Use the build tag in our CI builds for CentOS Stream 9 and Fedora 38, as they already has container-selinux 2.224.0 available in updates. Add a TODO to use the build tag for CentOS Stream 8 once it has container-selinux updated. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-10-30 16:55:41 -07:00
Kir Kolyshkin	efbebb39b5	libct: rename root to stateDir in struct Container The name "root" (or "containerRoot") is confusing; one might think it is the root of container's file system (the directory we chroot into). Rename to stateDir for clarity. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-10-04 14:57:10 +11:00
Kir Kolyshkin	c89faacc13	libc: rm _LIBCONTAINER_STATEDIR It's only user was recently removed. Fixes: `0e9a3358f8` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-10-04 14:57:10 +11:00
Aleksa Sarai	9350f9013e	merge #4039 into opencontainers/runc:main Kir Kolyshkin (1): libct: use chmod instead of umask LGTMs: lifubang cyphar	2023-10-04 14:55:07 +11:00
Kir Kolyshkin	2e2ecf29ff	libct: use chmod instead of umask Umask is problematic for Go programs as it affects other goroutines (see [1] for more details). Instead of using it, let's just prop up with Chmod. Note this patch misses the MkdirAll call in createDeviceNode. Since the runtime spec does not say anything about creating intermediary directories for device nodes, let's assume that doing it via mkdir with the current umask set is sufficient (if not, we have to reimplement MkdirAll from scratch, with added call to os.Chmod). [1] https://github.com/opencontainers/runc/pull/3563#discussion_r990293788 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-09-27 16:46:53 -07:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Aleksa Sarai	f8348f64ae	tests: integration: add runc-dmz smoke tests Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:38:21 +10:00
lifubang	dac4171746	runc-dmz: reduce memfd binary cloning cost with small C binary The idea is to remove the need for cloning the entire runc binary by replacing the final execve() call of the container process with an execve() call to a clone of a small C binary which just does an execve() of its arguments. This provides similar protection against CVE-2019-5736 but without requiring a >10MB binary copy for each "runc init". When compiled with musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB which is still quite large). It should be noted that there is still a window where the container processes could get access to the host runc binary, but because we set ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which is not enabled by default in Docker) in order to get around the proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the kernel blocks access entirely for user namespaced containers in this scenario. For those cases we cannot use runc-dmz, but most containers won't have this issue. This new runc-dmz binary can be opted out of at compile time by setting the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy environment variable. In both cases, runc will fall back to the classic /proc/self/exe-based cloning trick. If /proc/self/exe is already a sealed memfd (namely if the user is using contrib/cmd/memfd-bind to create a persistent sealed memfd for runc), neither runc-dmz nor /proc/self/exe cloning will be used because they are not necessary. [1]: `bfedb58925` Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [cyphar: address various review nits] [cyphar: fix runc-dmz cross-compilation] [cyphar: embed runc-dmz into runc binary and clone in Go code] [cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning] [cyphar: do not use runc-dmz when the container has certain privs] Co-authored-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:38:19 +10:00
Aleksa Sarai	0e9a3358f8	nsexec: migrate memfd /proc/self/exe logic to Go code This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:13:18 +10:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	9acfd7b1a3	timens: minor cleanups Fix up a few things that were flagged in the review of the original timens PR, namely around error handling and validation. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-10 18:59:55 +10:00
Aleksa Sarai	0866112e81	merge #3876 into opencontainers/runc:main Chethan Suresh (1): Support time namespace LGTMs: kolyskin cyphar Closes #3876	2023-08-10 18:27:17 +10:00
Kir Kolyshkin	cb44958162	Merge pull request #3385 from kolyshkin/init-logger-setup init simplification	2023-08-08 19:05:02 -07:00
Kir Kolyshkin	789a73db22	init.go: move logger setup to StartInitialization Currently, logrus is used from the Go part of runc init, mostly for a few debug messages (see setns_init_linux.go and standard_init_linux.go), and a single warning (see rootfs_linux.go). This means logrus is part of init implementation, and thus, its setup belongs to StartInitialization(). Move the code there. As a nice side effect, now we don't have to convert _LIBCONTAINER_LOGPIPE twice. Note that since this initialization is now also called from libct/int tests, which do not set _LIBCONTAINER_LOGLEVEL, let's make _LIBCONTAINER_LOGLEVEL optional. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-04 13:00:34 -07:00
Kir Kolyshkin	cb981e510b	libct: move criu-related stuff to separate file No code change, only added periods to some comments to make godot happy. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:16:01 -07:00
Chethan Suresh	ebc2e7c435	Support time namespace "time" namespace was introduced in Linux v5.6 support new time namespace to set boottime and monotonic time offset Example runtime spec "timeOffsets": { "monotonic": { "secs": 172800, "nanosecs": 0 }, "boottime": { "secs": 604800, "nanosecs": 0 } } Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>	2023-08-03 10:12:01 +05:30
Francis Laniel	46ada59ba2	Use an int for srcFD Previously to this commit, we used a string for srcFD as /proc/self/fd/NN. This commit modified to this behavior, so srcFD is only an int and the full path is constructed in mountViaFDs() if srcFD is different than nil. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Francis Laniel	c47f58c4e9	Capitalize [UG]idMappings as [UG]IDMappings Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Rodrigo Campos	fda12ab101	Support idmap mounts on volumes This commit adds support for idmap mounts as specified in the runtime-spec. We open the idmap source paths and call mount_setattr() in runc PARENT, as we need privileges in the init userns for that, and then sends the fds to the child process. For this fd passing we use the same mechanism used in other parts of thecode, the _LIBCONTAINER_ env vars. The mount is finished (unix.MoveMount) from go code, inside the userns, so we reuse all the prepareBindMount() security checks and the remount logic for some flags too. This commit only supports idmap mounts when userns are used AND the mappings are the same specified for the userns mapping. This limitation is to simplify the initial implementation, as all our users so far only need this, and we can avoid sending over netlink the mappings, creating a userns with this custom mapping, etc. Future PRs will remove this limitation. Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-17 13:30:12 +02:00
Rodrigo Campos	f5814a1007	libcontainer: Add generic sendFdsSources() Let's move the code to send mount sources to a generic function. Future patches will use it for idmap sources too. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Rodrigo Campos	fbf183c6f8	Add uid and gid mappings to mounts Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Kir Kolyshkin	f8ad20f500	runc kill: drop -a option As of previous commit, this is implied in a particular scenario. In fact, this is the one and only scenario that justifies the use of -a. Drop the option from the documentation. For backward compatibility, do recognize it, and retain the feature of ignoring the "container is stopped" error when set. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:30:40 -07:00
Kir Kolyshkin	9583b3d1c2	libct: move killing logic to container.Signal By default, the container has its own PID namespace, and killing (with SIGKILL) its init process from the parent PID namespace also kills all the other processes. Obviously, it does not work that way when the container is sharing its PID namespace with the host or another container, since init is no longer special (it's not PID 1). In this case, killing container's init will result in a bunch of other processes left running (and thus the inability to remove the cgroup). The solution to the above problem is killing all the container processes, not just init. The problem with the current implementation is, the killing logic is implemented in libcontainer's initProcess.wait, and thus only available to libcontainer users, but not the runc kill command (which uses nonChildProcess.kill and does not use wait at all). So, some workarounds exist: - func destroy(c *Container) calls signalAllProcesses; - runc kill implements -a flag. This code became very tangled over time. Let's simplify things by moving the killing all processes from initProcess.wait to container.Signal, and documents the new behavior. In essence, this also makes `runc kill` to automatically kill all container processes when the container does not have its own PID namespace. Document that as well. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:29:25 -07:00
Kir Kolyshkin	2a7dcbbb40	libct: fix shared pidns detection When someone is using libcontainer to start and kill containers from a long lived process (i.e. the same process creates and removes the container), initProcess.wait method is used, which has a kludge to work around killing containers that do not have their own PID namespace. The code that checks for own PID namespace is not entirely correct. To be exact, it does not set sharePidns flag when the host/caller PID namespace is implicitly used. As a result, the above mentioned kludge does not work. Fix the issue, add a test case (which fails without the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
Kir Kolyshkin	5b8f8712a4	libct: signalAllProcesses: remove child reaping There are two very distinct usage scenarios for signalAllProcesses: * when used from the runc binary ("runc kill" command), the processes that it kills are not the children of "runc kill", and so calling wait(2) on each process is totally useless, as it will return ECHLD; * when used from a program that have created the container (such as libcontainer/integration test suite), that program can and should call wait(2), not the signalling code. So, the child reaping code is totally useless in the first case, and should be implemented by the program using libcontainer in the second case. I was not able to track down how this code was added, my best guess is it happened when this code was part of dockerd, which did not have a proper child reaper implemented at that time. Remove it, and add a proper documentation piece. Change the integration test accordingly. PS the first attempt to disable the child reaping code in signalAllProcesses was made in commit `bb912eb00c`, which used a questionable heuristic to figure out whether wait(2) should be called. This heuristic worked for a particular use case, but is not correct in general. While at it: - simplify signalAllProcesses to use unix.Kill; - document (container).Signal. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00

1 2 3 4 5 ...

377 Commits