With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These functions were added in ancient times, facilitating the
docker-in-docker case when cgroup namespace was not available.
As pointed out in commit 2b28b3c276, using init 1 cgroup is not
correct because it won't work in case of host PID namespace.
The last user of GetInitCgroup was removed by commit
54e20217a8. GetInitCgroupPath was never used
as far as I can see, nor was I able to find any external users.
Remove both functions. Modify the comment in libct/cg/fs.subsysPath
to not refer to GetInitCgroupPath.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently the parent process of the container is moved to the right
cgroup v2 tree when systemd is using a hybrid model (last line with 0::):
$ runc --systemd-cgroup run myid
/ # cat /proc/self/cgroup
12:cpuset:/system.slice/runc-myid.scope
11:blkio:/system.slice/runc-myid.scope
10:devices:/system.slice/runc-myid.scope
9:hugetlb:/system.slice/runc-myid.scope
8:memory:/system.slice/runc-myid.scope
7:rdma:/
6:perf_event:/system.slice/runc-myid.scope
5:net_cls,net_prio:/system.slice/runc-myid.scope
4:freezer:/system.slice/runc-myid.scope
3:pids:/system.slice/runc-myid.scope
2:cpu,cpuacct:/system.slice/runc-myid.scope
1:name=systemd:/system.slice/runc-myid.scope
0::/system.slice/runc-myid.scope
However, if a second process is executed in the same container, it is
not moved to the right cgroup v2 tree:
$ runc exec myid /bin/sh -c 'cat /proc/self/cgroup'
12:cpuset:/system.slice/runc-myid.scope
11:blkio:/system.slice/runc-myid.scope
10:devices:/system.slice/runc-myid.scope
9:hugetlb:/system.slice/runc-myid.scope
8:memory:/system.slice/runc-myid.scope
7:rdma:/
6:perf_event:/system.slice/runc-myid.scope
5:net_cls,net_prio:/system.slice/runc-myid.scope
4:freezer:/system.slice/runc-myid.scope
3:pids:/system.slice/runc-myid.scope
2:cpu,cpuacct:/system.slice/runc-myid.scope
1:name=systemd:/system.slice/runc-myid.scope
0::/user.slice/user-1000.slice/session-8.scope
This commit makes that processes executed with exec are placed into the
right cgroup v2 tree. The implementation checks if systemd is using a
hybrid mode (by checking if cgroups v2 is mounted in
/sys/fs/cgroup/unified), if yes, the path of the cgroup v2 slice for
this container is saved into the cgroup path list.
The fs group driver has a similar issue, in this case none of the runc
run or runc exec commands put the process in the right cgroups v2. This
commit also fixes that.
Having the processes of the container in its own cgroup v2 is useful
for any BPF programs that rely on bpf_get_current_cgroup_id(), like
https://github.com/kinvolk/inspektor-gadget/ for instance.
[@kolyshkin: rebased]
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do this for all errors except one from unix.*.
This fixes a bunch of errorlint warnings, like these
libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
if le, ok := err.(Error); ok {
^
libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
lerr, ok := err.(Error)
^
libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
_, ok := err.(*stateTransitionError)
^
libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Apparently it is inevitable that we have to read mountinfo multiple
times when dealing with cgroup v1. It seems we can only do it once
and reuse the data, without major modifications to the code.
This commit does a few things.
1. Drop our custom mountinfo parser implementation in favor of
moby/sys/mountinfo. While the custom parser is faster
(about 2x according to benchmark) for this particular case,
the one from the package is more correct and future-proof.
2. Read mountinfo only once, caching all the entries with fstype of
cgroup. With this, there's no need to worry about performance
degradation introduced above.
3. Drop "isSubsystemAvailable" optimization (introduced by commit
2a1a6cdf44) because now with the cache it is probably slowing
things down.
4. Modify the tests accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The `all` argument was introduced by commit f557996401 specifically
for use by cAdvisor (see [1]), but there were no test cases added,
so it was later broken by 5ee0648bfb which started incrementing
numFound unconditionally.
Fix this (by not checking numFound in case all is true), and add a
simple test case to avoid future regressions.
[1] https://github.com/google/cadvisor/pull/1476
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case cgroupPath is under the default cgroup prefix, let's try to
guess the mount point by adding the subsystem name to the default
prefix, and resolving the resulting path in case it's a symlink.
In most cases, given the default cgroup setup, this trick
should result in returning the same result faster, and avoiding
/proc/self/mountinfo parsing which is relatively slow and problematic.
Be very careful with the default path, checking it is
- a directory;
- a mount point;
- has cgroup fstype.
If something is not right, fall back to parsing mountinfo.
While at it, remove the obsoleted comment about mountinfo parsing. The
comment belongs to findCgroupMountpointAndRootFromReader(), but rather
than moving it there, let's just remove it, since it does not add any
value in understanding the current code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.
There are no code changes in this commit, just moving it from one file
to another.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>