zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-11-03 01:43:44 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Kir Kolyshkin	fd5debf3aa	libct/cg: rm GetInitCgroup[Path] These functions were added in ancient times, facilitating the docker-in-docker case when cgroup namespace was not available. As pointed out in commit `2b28b3c276`, using init 1 cgroup is not correct because it won't work in case of host PID namespace. The last user of GetInitCgroup was removed by commit `54e20217a8`. GetInitCgroupPath was never used as far as I can see, nor was I able to find any external users. Remove both functions. Modify the comment in libct/cg/fs.subsysPath to not refer to GetInitCgroupPath. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-04-04 11:19:25 -07:00
Mauricio Vásquez	a8435007d9	cgroups: join cgroup v2 when using hybrid mode Currently the parent process of the container is moved to the right cgroup v2 tree when systemd is using a hybrid model (last line with 0::): $ runc --systemd-cgroup run myid / # cat /proc/self/cgroup 12:cpuset:/system.slice/runc-myid.scope 11:blkio:/system.slice/runc-myid.scope 10:devices:/system.slice/runc-myid.scope 9:hugetlb:/system.slice/runc-myid.scope 8:memory:/system.slice/runc-myid.scope 7:rdma:/ 6:perf_event:/system.slice/runc-myid.scope 5:net_cls,net_prio:/system.slice/runc-myid.scope 4:freezer:/system.slice/runc-myid.scope 3:pids:/system.slice/runc-myid.scope 2:cpu,cpuacct:/system.slice/runc-myid.scope 1:name=systemd:/system.slice/runc-myid.scope 0::/system.slice/runc-myid.scope However, if a second process is executed in the same container, it is not moved to the right cgroup v2 tree: $ runc exec myid /bin/sh -c 'cat /proc/self/cgroup' 12:cpuset:/system.slice/runc-myid.scope 11:blkio:/system.slice/runc-myid.scope 10:devices:/system.slice/runc-myid.scope 9:hugetlb:/system.slice/runc-myid.scope 8:memory:/system.slice/runc-myid.scope 7:rdma:/ 6:perf_event:/system.slice/runc-myid.scope 5:net_cls,net_prio:/system.slice/runc-myid.scope 4:freezer:/system.slice/runc-myid.scope 3:pids:/system.slice/runc-myid.scope 2:cpu,cpuacct:/system.slice/runc-myid.scope 1:name=systemd:/system.slice/runc-myid.scope 0::/user.slice/user-1000.slice/session-8.scope This commit makes that processes executed with exec are placed into the right cgroup v2 tree. The implementation checks if systemd is using a hybrid mode (by checking if cgroups v2 is mounted in /sys/fs/cgroup/unified), if yes, the path of the cgroup v2 slice for this container is saved into the cgroup path list. The fs group driver has a similar issue, in this case none of the runc run or runc exec commands put the process in the right cgroups v2. This commit also fixes that. Having the processes of the container in its own cgroup v2 is useful for any BPF programs that rely on bpf_get_current_cgroup_id(), like https://github.com/kinvolk/inspektor-gadget/ for instance. [@kolyshkin: rebased] Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 19:29:11 -07:00
Kir Kolyshkin	f6a0899b7f	: use errors.As and errors.Is Do this for all errors except one from unix.. This fixes a bunch of errorlint warnings, like these libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) if le, ok := err.(Error); ok { ^ libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) lerr, ok := err.(Error) ^ libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint) _, ok := err.(*stateTransitionError) ^ libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint) switch err { ^ Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	627a06ad92	Replace fmt.Errorf w/o %-style to errors.New Using fmt.Errorf for errors that do not have %-style formatting directives is an overkill. Switch to errors.New. Found by git grep fmt.Errorf \| grep -v ^vendor \| grep -v '%' Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 11:42:07 -07:00
Kir Kolyshkin	ed70dfa732	libct/cgroups/v1_utils: implement mountinfo cache Apparently it is inevitable that we have to read mountinfo multiple times when dealing with cgroup v1. It seems we can only do it once and reuse the data, without major modifications to the code. This commit does a few things. 1. Drop our custom mountinfo parser implementation in favor of moby/sys/mountinfo. While the custom parser is faster (about 2x according to benchmark) for this particular case, the one from the package is more correct and future-proof. 2. Read mountinfo only once, caching all the entries with fstype of cgroup. With this, there's no need to worry about performance degradation introduced above. 3. Drop "isSubsystemAvailable" optimization (introduced by commit `2a1a6cdf44`) because now with the cache it is probably slowing things down. 4. Modify the tests accordingly. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-01-06 14:54:22 -08:00
Kir Kolyshkin	7cd062d7be	libct/cgroup/utils: fix GetCgroupMounts(all=true) The `all` argument was introduced by commit `f557996401` specifically for use by cAdvisor (see [1]), but there were no test cases added, so it was later broken by `5ee0648bfb` which started incrementing numFound unconditionally. Fix this (by not checking numFound in case all is true), and add a simple test case to avoid future regressions. [1] https://github.com/google/cadvisor/pull/1476 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-12-01 15:15:30 -08:00
Kir Kolyshkin	b006f4a180	libct/cgroups: support Cgroups.Resources.Unified Add support for unified resource map (as per [1]), and add some test cases for the new functionality. [1] https://github.com/opencontainers/runtime-spec/pull/1040 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-24 15:29:35 -07:00
Kir Kolyshkin	a73ce38d16	cgroupv1/FindCgroupMountpoint: add a fast path In case cgroupPath is under the default cgroup prefix, let's try to guess the mount point by adding the subsystem name to the default prefix, and resolving the resulting path in case it's a symlink. In most cases, given the default cgroup setup, this trick should result in returning the same result faster, and avoiding /proc/self/mountinfo parsing which is relatively slow and problematic. Be very careful with the default path, checking it is - a directory; - a mount point; - has cgroup fstype. If something is not right, fall back to parsing mountinfo. While at it, remove the obsoleted comment about mountinfo parsing. The comment belongs to findCgroupMountpointAndRootFromReader(), but rather than moving it there, let's just remove it, since it does not add any value in understanding the current code. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-07 13:57:33 -07:00
Kir Kolyshkin	cec5ae7c2d	libct/cgroupv1/getCgroupMountsHelper: minor nit It is easy to just use TrimPrefix which does nothing in case the prefix does not exist. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-16 12:45:50 -07:00
Kir Kolyshkin	0681d456fc	libct/cgroups/utils: move cgroup v1 code to separate file In most project, "utils" is a big mess, and this is not an exception. Try to clean it up a bit by moving cgroup v1 specific code to a separate source file. There are no code changes in this commit, just moving it from one file to another. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-16 12:45:07 -07:00

11 Commits