zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-06 16:07:09 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	a56f85f87b	libct/*: switch from configs to cgroups Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-11 19:08:40 -08:00
Sebastiaan van Stijn	30b530ca94	libct/userns: split userns detection from internal userns code Commit `4316df8b53` isolated RunningInUserNS to a separate package to make it easier to consume without bringing in additional dependencies, and with the potential to move it separate in a similar fashion as libcontainer/user was moved to a separate module in commit `ca32014adb`. While RunningInUserNS is fairly trivial to implement, it (or variants of this utility) is used in many codebases, and moving to a separate module could consolidate those implementations, as well as making it easier to consume without large dependency trees (when being a package as part of a larger code base). Commit `1912d5988b` and follow-ups introduced cgo code into the userns package, and code introduced in those commits are not intended for external use, therefore complicating the potential of moving the userns package separate. This commit moves the new code to a separate package; some of this code was included in v1.1.11 and up, but I could not find external consumers of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and `Handles` types (added in `ba0b5e2698`) only exist in main and in non-stable releases (v1.2.0-rc.x), so don't need an alias / deprecation. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-30 20:06:30 +02:00
lifubang	4ea0bf88fd	update/add some tests for rlimit issues: https://github.com/opencontainers/runc/issues/4195 https://github.com/opencontainers/runc/pull/4265#discussion_r1588599809 Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-08 10:57:10 +00:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	09822c3da8	configs: disallow ambiguous userns and timens configurations For userns and timens, the mappings (and offsets, respectively) cannot be changed after the namespace is first configured. Thus, configuring a container with a namespace path to join means that you cannot also provide configuration for said namespace. Previously we would silently ignore the configuration (and just join the provided path), but we really should be returning an error (especially when you consider that the configuration userns mappings are used quite a bit in runc with the assumption that they are the correct mapping for the userns -- but in this case they are not). In the case of userns, the mappings are also required if you _do not_ specify a path, while in the case of the time namespace you can have a container with a timens but no mappings specified. It should be noted that the case checking that the user has not specified a userns path and a userns mapping needs to be handled in specconv (as opposed to the configuration validator) because with this patchset we now cache the mappings of path-based userns configurations and thus the validator can't be sure whether the mapping is a cached mapping or a user-specified one. So we do the validation in specconv, and thus the test for this needs to be an integration test. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-05 17:46:09 +11:00
Francis Laniel	c47f58c4e9	Capitalize [UG]idMappings as [UG]IDMappings Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Kir Kolyshkin	f8ad20f500	runc kill: drop -a option As of previous commit, this is implied in a particular scenario. In fact, this is the one and only scenario that justifies the use of -a. Drop the option from the documentation. For backward compatibility, do recognize it, and retain the feature of ignoring the "container is stopped" error when set. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:30:40 -07:00
Kir Kolyshkin	9583b3d1c2	libct: move killing logic to container.Signal By default, the container has its own PID namespace, and killing (with SIGKILL) its init process from the parent PID namespace also kills all the other processes. Obviously, it does not work that way when the container is sharing its PID namespace with the host or another container, since init is no longer special (it's not PID 1). In this case, killing container's init will result in a bunch of other processes left running (and thus the inability to remove the cgroup). The solution to the above problem is killing all the container processes, not just init. The problem with the current implementation is, the killing logic is implemented in libcontainer's initProcess.wait, and thus only available to libcontainer users, but not the runc kill command (which uses nonChildProcess.kill and does not use wait at all). So, some workarounds exist: - func destroy(c *Container) calls signalAllProcesses; - runc kill implements -a flag. This code became very tangled over time. Let's simplify things by moving the killing all processes from initProcess.wait to container.Signal, and documents the new behavior. In essence, this also makes `runc kill` to automatically kill all container processes when the container does not have its own PID namespace. Document that as well. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:29:25 -07:00
Kir Kolyshkin	2a7dcbbb40	libct: fix shared pidns detection When someone is using libcontainer to start and kill containers from a long lived process (i.e. the same process creates and removes the container), initProcess.wait method is used, which has a kludge to work around killing containers that do not have their own PID namespace. The code that checks for own PID namespace is not entirely correct. To be exact, it does not set sharePidns flag when the host/caller PID namespace is implicitly used. As a result, the above mentioned kludge does not work. Fix the issue, add a test case (which fails without the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
Kir Kolyshkin	5b8f8712a4	libct: signalAllProcesses: remove child reaping There are two very distinct usage scenarios for signalAllProcesses: * when used from the runc binary ("runc kill" command), the processes that it kills are not the children of "runc kill", and so calling wait(2) on each process is totally useless, as it will return ECHLD; * when used from a program that have created the container (such as libcontainer/integration test suite), that program can and should call wait(2), not the signalling code. So, the child reaping code is totally useless in the first case, and should be implemented by the program using libcontainer in the second case. I was not able to track down how this code was added, my best guess is it happened when this code was part of dockerd, which did not have a proper child reaper implemented at that time. Remove it, and add a proper documentation piece. Change the integration test accordingly. PS the first attempt to disable the child reaping code in signalAllProcesses was made in commit `bb912eb00c`, which used a questionable heuristic to figure out whether wait(2) should be called. This heuristic worked for a particular use case, but is not correct in general. While at it: - simplify signalAllProcesses to use unix.Kill; - document (container).Signal. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:23:29 -07:00
Kir Kolyshkin	f2e71b085d	libct/int: make TestFdLeaks more robust The purpose of this test is to check that there are no extra file descriptors left open after repeated calls to runContainer. In fact, the first call to runContainer leaves a few file descriptors opened, and this is by design. Previously, this test relied on two things: 1. some other tests were run before it (and thus all such opened-once file descriptors are already opened); 2. explicitly excluding fd opened to /sys/fs/cgroup. Now, if we run this test separately, it will fail (because of 1 above). The same may happen if the tests are run in a random order. To fix this, add a container run before collection the initial fd list, so those fds that are opened once are included and won't be reported. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-02-22 02:58:47 -08:00
Kir Kolyshkin	be7e03940f	libct/int: wording nits Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-02-22 02:58:47 -08:00
Kir Kolyshkin	7c75e84e22	libc/int: add/use runContainerOk wrapper This is to de-duplicate the code that checks that err is nil and that the exit code is zero. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-02-22 02:58:47 -08:00
Kir Kolyshkin	98fe566c52	runc: do not set inheritable capabilities Do not set inheritable capabilities in runc spec, runc exec --cap, and in libcontainer integration tests. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-05-12 08:14:50 +10:00
Kir Kolyshkin	0fec1c2d8c	libct: Mount: rm {Pre,Post}mountCmds Those were added by commit `59c5c3ac0` back in Apr 2015, but AFAICS were never used and are obsoleted by more generic container hooks (initially added by commit `05567f2c94` in Sep 2015). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 15:51:55 -08:00
Kir Kolyshkin	953e56c56f	libct/int: runContainer: drop console arg It is not and was never ever used. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	972aea3af0	libct/configs/validate: allow / in sysctl names Runtime spec says: > sysctl (object, OPTIONAL) allows kernel parameters to be modified at > runtime for the container. For more information, see the sysctl(8) > man page. and sysctl(8) says: > variable > The name of a key to read from. An example is > kernel.ostype. The '/' separator is also accepted in place of a '.'. Apparently, runc config validator do not support sysctls with / as a separator. Fortunately this is a one-line fix. Add some more test data where / is used as a separator. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-29 09:45:55 -07:00
Akihiro Suda	95f8ecdd53	fix `libcontainer/integration/exec_test.go:1859:8: undefined: ioutil` Fix `4d17654479` Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-10-28 14:56:03 +09:00
Akihiro Suda	4d17654479	Merge pull request #2576 from kinvolk/alban/userns-2484-take2 Open bind mount sources from the host userns	2021-10-28 14:50:33 +09:00
Mauricio Vásquez	8542322dfe	libcontainer: Add unit tests with userns and mounts Add a unit test to check that bind mounts that have a part of its path non accessible by others still work when using user namespaces. To do this, we also modify newRoot() to return rootfs directories that can be traverse by others, so the rootfs created works for all test (either running in a userns or not). Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-10-16 17:29:33 +02:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Kir Kolyshkin	3bc606e9d3	libct/int: adapt to Go 1.15 1. Use t.TempDir instead of ioutil.TempDir. This means no need for an explicit cleanup, which removes some code, including newTestBundle and newTestRoot. 2. Move newRootfs invocation down to newTemplateConfig, removing a need for explicit rootfs creation. Also, remove rootfs from tParam as it is no longer needed (there was a since test case in which two containers shared the same rootfs, but it does not look like it's required for the test). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-07-27 01:41:47 -07:00
Kir Kolyshkin	5dc3260431	libct/int/TestFreeze: test freeze/thaw via Set In addition to freezing and thawing a container via Pause/Resume, there is a way to also do so via Set. This way was broken though and is being fixed by a few preceding commits. The test is added to make sure this is fixed and won't regress. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-07-14 23:42:35 -07:00
Kir Kolyshkin	e969d42156	libct/int/testPids: logging nits Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-02 17:31:03 -07:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Aleksa Sarai	ed4781029f	merge branch 'pr-2781' Sebastiaan van Stijn (7): errcheck: utils errcheck: signals errcheck: tty errcheck: libcontainer errcheck: libcontainer/nsenter errcheck: libcontainer/configs errcheck: libcontainer/integration LGTM: AkihiroSuda cyphar Closes #2781	2021-05-25 12:31:52 +10:00
Aleksa Sarai	54904516e6	libcontainer: fix integration failure in "make test" When running inside a Docker container, systemd is not available. The new TestFdLeaksSystemd forgot to include the relevant t.Skip section. Fixes: `a7feb42395` ("libct/int: add TestFdLeaksSystemd") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-05-23 17:55:09 +10:00
Aleksa Sarai	c7c70ce810	*: clean t.Skip messages Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-05-23 17:53:01 +10:00
Sebastiaan van Stijn	a899505377	errcheck: libcontainer/integration Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-05-20 14:17:40 +02:00
Kir Kolyshkin	a7feb42395	libct/int: add TestFdLeaksSystemd Add a test to check that container.Run do not leak file descriptors. Before the previous commit, it fails like this: exec_test.go:2030: extra fd 8 -> socket:[659703] exec_test.go:2030: extra fd 11 -> socket:[658715] exec_test.go:2033: found 2 extra fds after container.Run Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-05-06 12:37:55 -07:00
Kir Kolyshkin	6faed0e486	libct/int: use ok(t, err) ... in all the places it makes sense to use it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-15 13:03:17 -07:00
Kir Kolyshkin	af3c5699a5	libct/int: remove unused code Since commit `88e8350de2` the error message is different, so the check is not working. In addition, for the cgroup v2 case, and it seems that PID controller is always available these days. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-15 12:46:17 -07:00
Kir Kolyshkin	7b802a7da4	libct/int: better test container names 1. Do not create the same container named "test" over and over. 2. Fix randomization issues when generating container and cgroup names. The issues were: * math/rand used without seeding * complex rand/md5/hexencode sequence In both cases, replace with nanosecond time encoded with digits and lowercase letters. 3. Add test name to container and cgroup names. For example, this is how systemd log has changed: Before: Started libcontainer container test16ddfwutxgjte. After: Started libcontainer container TestPidsSystemd-4oaqvr. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-15 12:37:59 -07:00
Qiang Huang	2d38476c96	Merge pull request #2840 from kolyshkin/ignore-kmem Ignore kernel memory settings	2021-04-13 09:44:14 +08:00
Kir Kolyshkin	52390d6804	Ignore kernel memory settings This is somewhat radical approach to deal with kernel memory. Per-cgroup kernel memory limiting was always problematic. A few examples: - older kernels had bugs and were even oopsing sometimes (best example is RHEL7 kernel); - kernel is unable to reclaim the kernel memory so once the limit is hit a cgroup is toasted; - some kernel memory allocations don't allow failing. In addition to that, - users don't have a clue about how to set kernel memory limits (as the concept is much more complicated than e.g. [user] memory); - different kernels might have different kernel memory usage, which is sort of unexpected; - cgroup v2 do not have a [dedicated] kmem limit knob, and thus runc silently ignores kernel memory limits for v2; - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see https://github.com/torvalds/linux/commit/0158115f702b). In view of all this, and as the runtime-spec lists memory.kernel and memory.kernelTCP as OPTIONAL, let's ignore kernel memory limits (for cgroup v1, same as we're already doing for v2). This should result in less bugs and better user experience. The only bad side effect from it might be that stat can show kernel memory usage as 0 (since the accounting is not enabled). [v2: add a warning in specconv that limits are ignored] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-12 12:18:11 -07:00
Kir Kolyshkin	79a8647b81	libct/int: add TestFdLeaks This is a very simple test that checks that container.Run do not leak opened file descriptors. In fact it does, so we have to add two exclusions: 1. /sys/fs/cgroup is opened once per lifetime in prepareOpenat2(), provided that cgroupv2 is used and openat2 is available. This works as intended ("it's not a bug, it's a feature"). 2. ebpf program fd is leaked every time we call setDevices() for cgroupv2 (iow, every container.Run or container.Set leaks 1 fd). This needs to be fixed, thus FIXME. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-30 19:58:09 -07:00
Kir Kolyshkin	3de5c51454	libct/int: don't hardcode CAP_NET_ADMIN ... use the one from unix instead. Coincidentally, this fixes this warning from gosimple linter: > libcontainer/integration/exec_test.go:448:2: S1021: should merge variable declaration with assignment on next line (gosimple) > var netAdminBit uint > ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-03 10:24:27 -08:00
Kir Kolyshkin	3387422bf9	libct/int: fix "simple" linter warnings This fixes the following warnings: > libcontainer/integration/exec_test.go:369:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple) > outputStatus := string(stdout.Bytes()) > ^ > libcontainer/integration/exec_test.go:422:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple) > outputStatus := string(stdout.Bytes()) > ^ > libcontainer/integration/exec_test.go:486:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple) > outputGroups := string(stdout.Bytes()) > ^ > libcontainer/integration/execin_test.go:191:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple) > outputGroups := string(stdout.Bytes()) > ^ > libcontainer/integration/execin_test.go:474:9: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple) > out := string(stdout.Bytes()) > ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-03 10:24:27 -08:00
Mauricio Vásquez	ac5ec5e32f	libcontainer/integration: fix unit test Fix a merge issue between `0aa0fae393` ("Kill all processes in cgroup even if init process Wait fails") & `73d93eeb01` ("libct/int: make newTemplateConfig argument a struct") that resulted in passing the wrong datatype to newTemplateConfig in TestPIDHostInitProcessWait. Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>	2020-10-23 07:56:11 -05:00
Mrunal Patel	07e35a7a40	Merge pull request #2600 from kolyshkin/libct-int-wut libcontainer/integration: fix cgroupv1 + systemd tests	2020-10-22 21:05:15 -07:00
Kir Kolyshkin	44f221e2fc	Merge pull request #2633 from chaitanyabandi/2632-fix Kill processes in cgroup even if process Wait fails	2020-10-08 10:51:13 -07:00
Chaitanya Bandi	0aa0fae393	Kill all processes in cgroup even if init process Wait fails If the cgroup's init process doesn't complete successfully, Wait returns a non-nil error. We should still kill all the process in the cgroup if process namespace is shared. Otherwise, it may result in process leak. Fixes #2632 Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>	2020-10-08 01:26:34 +00:00
Amim Knabben	978fa6e906	Fixing some lint issues Signed-off-by: Amim Knabben <amim.knabben@gmail.com>	2020-10-06 14:44:14 -04:00
Kir Kolyshkin	fa47f95872	libct/int/newTemplateConfig: add systemd support ... and properly set ScopePrefix, Name and Parent for a systemd case. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-10-05 20:51:02 -07:00
Kir Kolyshkin	9135d99c94	libct/int/newTemplateConfig: add userns param It seems that a few tests add a cgroup mount in case userns is not set. Let's do it inside newTemplateConfig() for all tests. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-10-05 20:51:02 -07:00
Kir Kolyshkin	73d93eeb01	libct/int: make newTemplateConfig argument a struct ...so we can add more fields later. This commit is mostly courtesy of sed -i 's/newTemplateConfig(rootfs)/newTemplateConfig(\&tParam{rootfs: rootfs})/g' Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-10-05 20:51:02 -07:00
Sebastiaan van Stijn	28b452bf65	libcontainer: unconvert Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-10-01 18:36:56 +02:00
Sebastiaan van Stijn	b3a8b0742c	libcontainer: prefer bytes.TrimSpace() over strings.TrimSpace() Perform trimming before converting to a string, which should be somewhat more performant. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-10-01 18:36:53 +02:00
Sebastiaan van Stijn	8bf216728c	use string-concatenation instead of sprintf for simple cases Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-09-30 10:51:59 +02:00
Kir Kolyshkin	b006f4a180	libct/cgroups: support Cgroups.Resources.Unified Add support for unified resource map (as per [1]), and add some test cases for the new functionality. [1] https://github.com/opencontainers/runtime-spec/pull/1040 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-24 15:29:35 -07:00

1 2 3

110 Commits