zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-05 07:27:03 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	9acfd7b1a3	timens: minor cleanups Fix up a few things that were flagged in the review of the original timens PR, namely around error handling and validation. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-10 18:59:55 +10:00
Aleksa Sarai	0866112e81	merge #3876 into opencontainers/runc:main Chethan Suresh (1): Support time namespace LGTMs: kolyskin cyphar Closes #3876	2023-08-10 18:27:17 +10:00
Kir Kolyshkin	cb44958162	Merge pull request #3385 from kolyshkin/init-logger-setup init simplification	2023-08-08 19:05:02 -07:00
Kir Kolyshkin	789a73db22	init.go: move logger setup to StartInitialization Currently, logrus is used from the Go part of runc init, mostly for a few debug messages (see setns_init_linux.go and standard_init_linux.go), and a single warning (see rootfs_linux.go). This means logrus is part of init implementation, and thus, its setup belongs to StartInitialization(). Move the code there. As a nice side effect, now we don't have to convert _LIBCONTAINER_LOGPIPE twice. Note that since this initialization is now also called from libct/int tests, which do not set _LIBCONTAINER_LOGLEVEL, let's make _LIBCONTAINER_LOGLEVEL optional. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-04 13:00:34 -07:00
Kir Kolyshkin	cb981e510b	libct: move criu-related stuff to separate file No code change, only added periods to some comments to make godot happy. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-03 10:16:01 -07:00
Chethan Suresh	ebc2e7c435	Support time namespace "time" namespace was introduced in Linux v5.6 support new time namespace to set boottime and monotonic time offset Example runtime spec "timeOffsets": { "monotonic": { "secs": 172800, "nanosecs": 0 }, "boottime": { "secs": 604800, "nanosecs": 0 } } Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>	2023-08-03 10:12:01 +05:30
Francis Laniel	46ada59ba2	Use an int for srcFD Previously to this commit, we used a string for srcFD as /proc/self/fd/NN. This commit modified to this behavior, so srcFD is only an int and the full path is constructed in mountViaFDs() if srcFD is different than nil. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Francis Laniel	c47f58c4e9	Capitalize [UG]idMappings as [UG]IDMappings Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Rodrigo Campos	fda12ab101	Support idmap mounts on volumes This commit adds support for idmap mounts as specified in the runtime-spec. We open the idmap source paths and call mount_setattr() in runc PARENT, as we need privileges in the init userns for that, and then sends the fds to the child process. For this fd passing we use the same mechanism used in other parts of thecode, the _LIBCONTAINER_ env vars. The mount is finished (unix.MoveMount) from go code, inside the userns, so we reuse all the prepareBindMount() security checks and the remount logic for some flags too. This commit only supports idmap mounts when userns are used AND the mappings are the same specified for the userns mapping. This limitation is to simplify the initial implementation, as all our users so far only need this, and we can avoid sending over netlink the mappings, creating a userns with this custom mapping, etc. Future PRs will remove this limitation. Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-17 13:30:12 +02:00
Rodrigo Campos	f5814a1007	libcontainer: Add generic sendFdsSources() Let's move the code to send mount sources to a generic function. Future patches will use it for idmap sources too. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Rodrigo Campos	fbf183c6f8	Add uid and gid mappings to mounts Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Kir Kolyshkin	f8ad20f500	runc kill: drop -a option As of previous commit, this is implied in a particular scenario. In fact, this is the one and only scenario that justifies the use of -a. Drop the option from the documentation. For backward compatibility, do recognize it, and retain the feature of ignoring the "container is stopped" error when set. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:30:40 -07:00
Kir Kolyshkin	9583b3d1c2	libct: move killing logic to container.Signal By default, the container has its own PID namespace, and killing (with SIGKILL) its init process from the parent PID namespace also kills all the other processes. Obviously, it does not work that way when the container is sharing its PID namespace with the host or another container, since init is no longer special (it's not PID 1). In this case, killing container's init will result in a bunch of other processes left running (and thus the inability to remove the cgroup). The solution to the above problem is killing all the container processes, not just init. The problem with the current implementation is, the killing logic is implemented in libcontainer's initProcess.wait, and thus only available to libcontainer users, but not the runc kill command (which uses nonChildProcess.kill and does not use wait at all). So, some workarounds exist: - func destroy(c *Container) calls signalAllProcesses; - runc kill implements -a flag. This code became very tangled over time. Let's simplify things by moving the killing all processes from initProcess.wait to container.Signal, and documents the new behavior. In essence, this also makes `runc kill` to automatically kill all container processes when the container does not have its own PID namespace. Document that as well. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:29:25 -07:00
Kir Kolyshkin	2a7dcbbb40	libct: fix shared pidns detection When someone is using libcontainer to start and kill containers from a long lived process (i.e. the same process creates and removes the container), initProcess.wait method is used, which has a kludge to work around killing containers that do not have their own PID namespace. The code that checks for own PID namespace is not entirely correct. To be exact, it does not set sharePidns flag when the host/caller PID namespace is implicitly used. As a result, the above mentioned kludge does not work. Fix the issue, add a test case (which fails without the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
Kir Kolyshkin	5b8f8712a4	libct: signalAllProcesses: remove child reaping There are two very distinct usage scenarios for signalAllProcesses: * when used from the runc binary ("runc kill" command), the processes that it kills are not the children of "runc kill", and so calling wait(2) on each process is totally useless, as it will return ECHLD; * when used from a program that have created the container (such as libcontainer/integration test suite), that program can and should call wait(2), not the signalling code. So, the child reaping code is totally useless in the first case, and should be implemented by the program using libcontainer in the second case. I was not able to track down how this code was added, my best guess is it happened when this code was part of dockerd, which did not have a proper child reaper implemented at that time. Remove it, and add a proper documentation piece. Change the integration test accordingly. PS the first attempt to disable the child reaping code in signalAllProcesses was made in commit `bb912eb00c`, which used a questionable heuristic to figure out whether wait(2) should be called. This heuristic worked for a particular use case, but is not correct in general. While at it: - simplify signalAllProcesses to use unix.Kill; - document (container).Signal. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
Kir Kolyshkin	a60933bb24	libct/rootfs: introduce and use mountEntry Adding fd field to mountConfig was not a good thing since mountConfig contains data that is not specific to a particular mount, while fd is a mount entry attribute. Introduce mountEntry structure, which embeds configs.Mount and adds srcFd to replace the removed mountConfig.fd. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-05-02 18:54:38 -07:00
Kir Kolyshkin	976748e8d6	libct: add mountViaFDs, simplify mount 1. Simplify mount call by removing the procfd argument, and use the new mount() where procfd is not used. Now, the mount() arguments are the same as for unix.Mount. 2. Introduce a new mountViaFDs function, which is similar to the old mount(), except it can take procfd for both source and target. The new arguments are called srcFD and dstFD. 3. Modify the mount error to show both srcFD and dstFD so it's clear which one is used for which purpose. This fixes the issue of having a somewhat cryptic errors like this: > mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted (in which fd 11 is actually the source, and fd 12 is the target). After this change, it looks like > mount src=/proc/self/fd/11, dst=/sys/fs/cgroup/systemd, dstFD=/proc/self/fd/12, flags=0x20502f: operation not permitted so it's clear that 12 is a destination fd. 4. Fix the mountViaFDs callers to use dstFD (rather than procfd) for the variable name. 5. Use srcFD where mountFd is set. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-05-02 18:41:09 -07:00
Kir Kolyshkin	fe278b9caa	libct: fix a race with systemd removal For a previous attempt to fix that (and added test cases), see commit `9087f2e827`. Alas, it's not always working because of cgroup directory TOCTOU. To solve this and avoid the race, add an error _after_ the operation. Implement it as a method that ignores the error that should be ignored. Instead of currentStatus(), use faster runType(), since we are not interested in Paused status here. For Processes(), remove the pre-op check, and only use it after getting an error, making the non-error path more straightforward. For Signal(), add a second check after getting an error. The first check is left as is because signalAllProcesses might print a warning if the cgroup does not exist, and we'd like to avoid that. This should fix an occasional failure like this one: not ok 84 kill detached busybox # (in test file tests/integration/kill.bats, line 27) # `[ "$status" -eq 0 ]' failed .... # runc kill test_busybox KILL (status=0): # runc kill -a test_busybox 0 (status=1): # time="2023-04-04T18:24:27Z" level=error msg="lstat /sys/fs/cgroup/devices/system.slice/runc-test_busybox.scope: no such file or directory" Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-04-20 17:50:23 -07:00
Kir Kolyshkin	e8cf8783d1	libct/criuApplyCgroups: add a TODO I don't want to implement it now, because this might result in some new issues, but this is definitely something that is worth implementing. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-12-15 15:37:42 -08:00
Kir Kolyshkin	3438ef30b2	restore: fix --manage-cgroups-mode ignore on cgroup v2 When manage-cgroups-mode: ignore is used, criu still needs to know the cgroup path to work properly (see [1]). Revert "libct/criuApplyCgroups: don't set cgroup paths for v2" This reverts commit `d5c57dcea6`. [1]: https://github.com/checkpoint-restore/criu/issues/1793#issuecomment-1086675168 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-12-15 15:37:42 -08:00
Radostin Stoyanov	fbce47a6b6	deps: bump github.com/checkpoint-restore/go-criu to 6.3.0 Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>	2022-11-01 10:08:14 +00:00
Prajwal S N	746f45807d	deps: bump go-criu to v6 The v6.0.0 release of go-criu has deprecated the `rpc` package in favour of the `crit` package. This commit provides the changes required to use this version in runc. Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>	2022-09-06 11:55:17 +05:30
Kir Kolyshkin	102b8abd26	libct: rm BaseContainer and Container interfaces The only implementation of these is linuxContainer. It does not make sense to have an interface with a single implementation, and we do not foresee other types of containers being added to runc. Remove BaseContainer and Container interfaces, moving their methods documentation to linuxContainer. Rename linuxContainer to Container. Adopt users from using interface to using struct. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-23 11:04:12 -07:00
Kir Kolyshkin	7cec81e060	libct: suppress strings.Title deprecation warning Function strings.Title is deprecated as of Go 1.18, because it does not handle some corner cases good enough. In this case, though, it is perfectly fine to use it since we have a single ASCII word as an argument, and strings.Title won't be removed until at least Go 2.0. Suppress the deprecation warning. The alternative is to not capitalize the namespace string; this will break restoring of a container checkpointed by earlier version of runc. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-22 12:22:10 -07:00
lifubang	01f00e1fd5	ensure the path is a sub-cgroup path Signed-off-by: lifubang <lifubang@acmcoder.com>	2022-02-19 09:45:09 +08:00
Sebastiaan van Stijn	949111237a	Merge pull request #3303 from kolyshkin/labels libcontainer: optimize utils.SearchLabels	2022-02-16 16:37:00 +01:00
Kir Kolyshkin	dbd990d555	libct: rm intelrtd.Manager interface, NewIntelRdtManager Remove intelrtd.Manager interface, since we only have a single implementation, and do not expect another one. Rename intelRdtManager to Manager, and modify its users accordingly. Remove NewIntelRdtManager from factory. Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the feature is not available. Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew. Add internal function newManager to be used for tests (to make sure some testing is done even when the feature is not available in kernel/hardware). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 17:33:03 -08:00
Kir Kolyshkin	9258eac072	libct/start: use execabs for newuidmap lookup Since we are looking up the path to newuidmap/newgidmap in one context, and executing those in another (libct/nsenter), it might make sense to use a stricter rules for looking up path to those binaries. Practically it means that if someone wants to use custom newuidmap and newgidmap binaries from $PATH, it would be impossible to use these from the current directory by means of PATH=.:$PATH; instead one would have to do something like PATH=$(pwd):$PATH. See https://go.dev/blog/path-security for background. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 17:33:00 -08:00
Kir Kolyshkin	39bd7b7217	libct: Container, Factory: rm newuidmap/newgidmap These were introduced in commit `d8b669400` back in 2017, with a TODO of "make binary names configurable". Apparently, everyone is happy with the hardcoded names. In fact, they are configurable (by prepending the PATH with a directory containing own version of newuidmap/newgidmap). Now, these binaries are only needed in a few specific cases (when rootless is set etc.), so let's look them up only when needed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 11:40:29 -08:00
Kir Kolyshkin	630c0d7e8c	libct: Container, Factory: rm InitPath, InitArgs Those are always /proc/self/exe init, and it does not make sense to ever change these. More to say, if InitArgs option func (removed by this commit) is used to change these parameters, it will break things, since "init" is hardcoded elsewhere. Remove this. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 11:40:29 -08:00
Sebastiaan van Stijn	e4e2a9dda4	Merge pull request #3360 from danishprakash/remove-pausing libcontainer: remove "pausing" state	2022-02-01 23:40:31 +01:00
Akihiro Suda	e9190d3ae1	Merge pull request #3353 from kolyshkin/rm-criu-opt runc: remove --criu option	2022-02-01 08:28:14 +09:00
danishprakash	7346dda332	libcontainer: remove "pausing" state Signed-off-by: danishprakash <grafitykoncept@gmail.com>	2022-01-29 14:27:11 +05:30
Kir Kolyshkin	6e1d476aad	runc: remove --criu option This was introduced in an initial commit, back in the day when criu was a highly experimental thing. Today it's not; most users who need it have it packaged by their distro vendor. The usual way to run a binary is to look it up in directories listed in $PATH. This is flexible enough and allows for multiple scenarios (custom binaries, extra binaries, etc.). This is the way criu should be run. Make --criu a hidden option (thus removing it from help). Remove the option from man pages, integration tests, etc. Remove all traces of CriuPath from data structures. Add a warning that --criu is ignored and will be removed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 20:25:56 -08:00
Kir Kolyshkin	bb6a838876	libct: initContainer: rename Id -> ID Since the next commit is going to touch this structure, our CI (lint-extra) is about to complain about improperly named field: > Warning: var-naming: struct field ContainerId should be ContainerID (revive) Make it happy. Brought to use by gopls rename. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 18:59:47 -08:00
Kir Kolyshkin	dffb8db7e1	libct: handleCriuConfigurationFile: use utils.SearchLabels The utils.Annotations was used here before only because it made it possible to distinguish between "key not found" and "empty value" cases. With the previous commit, utils.SearchLabels can do that, and so it makes sense to use it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 14:01:11 -08:00
Aleksa Sarai	d72d057ba7	runc init: avoid netlink message length overflows When writing netlink messages, it is possible to have a byte array larger than UINT16_MAX which would result in the length field overflowing and allowing user-controlled data to be parsed as control characters (such as creating custom mount points, changing which set of namespaces to allow, and so on). Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-12-03 16:07:40 +11:00
Aleksa Sarai	dde509df4e	specconv: do not permit null bytes in mount fields Using null bytes as control characters for sending strings via netlink opens us up to a user explicitly putting a null byte in a mount string (which JSON will happily let you do) and then causing us to open a mount path different to the one expected. In practice this is more of an issue in an environment such as Kubernetes where you may have path-based access control policies (which are more susceptible to these kinds of flaws). Found by Google Project Zero. Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Reported-by: Felix Wilhelm <fwilhelm@google.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-11-19 11:41:05 +11:00
Akihiro Suda	4d17654479	Merge pull request #2576 from kinvolk/alban/userns-2484-take2 Open bind mount sources from the host userns	2021-10-28 14:50:33 +09:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Kir Kolyshkin	6a4f4a6a37	libcontainer/ignoreTerminateErrors: simplify for Go 1.16+ One less TODO in the code, yay! Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Alban Crequy	9c444070ec	Open bind mount sources from the host userns The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount fds, and close them early. Fixes: #2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-10-12 15:13:45 +02:00
Kir Kolyshkin	0202c398ff	runc exec: implement --cgroup In some setups, multiple cgroups are used inside a container, and sometime there is a need to execute a process in a particular sub-cgroup (in case of cgroup v1, for a particular controller). This is what this commit implements. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-27 10:25:42 -07:00
Kir Kolyshkin	03244ef2cf	Merge pull request #3217 from kolyshkin/delete-paused runc delete -f: fix for cg v1 + paused container	2021-09-20 10:51:40 -07:00
Adrian Reber	43b36dc4ac	Support changing of lsm mount context on restore Wire through CRIU's support to change the mount context on restore. This is especially useful if restoring a container in a different pod. Single container restore uses the same SELinux process label and same mount context as during checkpointing. If a container is being restored into an existing pod the process label and the mount context needs to be changed to the context of the pod. Changing process label on restore is already supported by runc. This patch adds the possibility to change the mount context. Signed-off-by: Adrian Reber <areber@redhat.com>	2021-09-20 10:01:16 +02:00
Kir Kolyshkin	6806b2c1c4	runc delete -f: fix for cg v1 + paused container runc delete -f is not working for a paused container, since in cgroup v1 SIGKILL does nothing if a process is frozen (unlike cgroup v2, in which you can kill a frozen process with a fatal signal). Theoretically, we only need this for v1, but doing it for v2 as well is OK. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-15 14:55:14 -07:00
Aleksa Sarai	8bf032602a	merge branch 'pr-3047' Liu Hua (1): checkpoint: resolve symlink for external bind mount(fix ci broken) LGTMs: kolyshkin cyphar	2021-09-09 14:24:26 +10:00
Akihiro Suda	bd75bc2dc6	Merge pull request #3176 from kolyshkin/rm-config-error-alt libct/error.go: rm ConfigError (alt)	2021-09-02 14:34:32 +09:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00

1 2 3 4 5 ...

343 Commits