It has been pointed out that some controllers can not accept multiple
lines of output at once. In particular, io.max can only set one device
at a time.
Practically, the only multi-line resource values we can get come from
unified.* -- let's write those line by line.
Add a test case.
Reported-by: Tao Shen <shentaoskyking@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We auto-close this file descriptor in the final exec step, but it's
probably a good idea to not possibly leak the file descriptor to "runc
init" (we've had issues like this in the past) especially since it is a
directory handle from the host mount namespace.
In practice, on runc 1.1 this does leak to "runc init" but on main the
handle has a low enough file descriptor that it gets clobbered by the
ForkExec of "runc init".
OPEN_TREE_CLONE would let us protect this handle even further, but the
performance impact of creating an anonymous mount namespace is probably
not worth it.
Also, switch to using an *os.File for the handle so if it goes out of
scope during setup (i.e. an error occurs during setup) it will get
cleaned up by the GC.
Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This prevents potential exploit of using "../" in cgroups.OpenFile
(as well as other methods that use OpenFile) to read or write to
other cgroups.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f34eb2c00 introduced a workaround to retry on EINTR due to changes in Go 1.14.
It was fixed in Go 1.15 [1], meaning a custom retry loop is no longer
necessary.
Keep the test case to avoid future regressions.
[1] https://github.com/golang/go/issues/38033
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.
Thus, these annotations are no longer necessary. Hooray!
[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
opencontainers/runc issue 3026 describes a scenario in which OpenFile
failed to open a legitimate existing cgroupfs file. Added debug
(similar to what this commit does) shown that cgroupFd is no longer
opened to "/sys/fs/cgroup", but to "/" (it's not clear what caused it,
and the source code is not available, but they might be using the same
process on the both sides of the container/chroot/pivot_root/mntns
boundary, or remounting /sys/fs/cgroup).
Consider such use incorrect, but give a helpful hint as two what is
going on by wrapping the error in a more useful message.
NB: this can potentially be fixed by reopening the cgroupFd once we
detected that it's screwed, and retrying openat2. Alas I do not have
a test case for this, so left this as a TODO suggestion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix reading cgroup files from the top cgroup directory, i.e.
/sys/fs/cgroup.
The code was working for for any subdirectory of /sys/fs/cgroup, but
for dir="/sys/fs/cgroup" a fallback (open and fstatfs) was used, because
of the way the function worked with the dir argument.
Fix those cases, and add unit tests to make sure they work. While at it,
make the rules for dir and name components more relaxed, and add test
cases for this, too.
While at it, improve OpenFile documentation, and remove a duplicated
doc comment for openFile.
Without these fixes, the unit test fails the following cases:
file_test.go:67: case {path:/sys/fs/cgroup name:cgroup.controllers}: fallback
file_test.go:67: case {path:/sys/fs/cgroup/ name:cgroup.controllers}: openat2 /sys/fs/cgroup//cgroup.controllers: invalid cross-device link
file_test.go:67: case {path:/sys/fs/cgroup/ name:/cgroup.controllers}: openat2 /sys/fs/cgroup///cgroup.controllers: invalid cross-device link
file_test.go:67: case {path:/ name:/sys/fs/cgroup/cgroup.controllers}: fallback
file_test.go:67: case {path:/ name:sys/fs/cgroup/cgroup.controllers}: fallback
file_test.go:67: case {path:/sys/fs/cgroup/cgroup.controllers name:}: openat2 /sys/fs/cgroup/cgroup.controllers/: not a directory
Here "fallback" means openat2-based implementation fails, and the fallback code
is used (and works).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from unix.* are always bare and thus can be used directly.
Add //nolint:errorlint annotation to ignore errors such as these:
libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
case errno == unix.ERANGE:
^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if e != unix.EINVAL {
^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if err != unix.EINVAL && err != unix.EPERM {
^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a better place as cgroups itself is using these.
Should help with moving more stuff common in between fs and fs2 to
fscommon.
Looks big, but this is just moving the code around:
fscommon/{fscommon,open}.go -> cgroups/file.go
fscommon/fscommon_test.go -> cgroups/file_test.go
and fixes for TestMode moved to a different package.
There's no functional change.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>