Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).
Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.
This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
As of previous commit, this is implied in a particular scenario. In
fact, this is the one and only scenario that justifies the use of -a.
Drop the option from the documentation. For backward compatibility, do
recognize it, and retain the feature of ignoring the "container is
stopped" error when set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.
Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).
The solution to the above problem is killing all the container
processes, not just init.
The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
- func destroy(c *Container) calls signalAllProcesses;
- runc kill implements -a flag.
This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.
In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.
The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.
Fix the issue, add a test case (which fails without the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are two very distinct usage scenarios for signalAllProcesses:
* when used from the runc binary ("runc kill" command), the processes
that it kills are not the children of "runc kill", and so calling
wait(2) on each process is totally useless, as it will return ECHLD;
* when used from a program that have created the container (such as
libcontainer/integration test suite), that program can and should call
wait(2), not the signalling code.
So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.
Remove it, and add a proper documentation piece.
Change the integration test accordingly.
PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb00c, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.
While at it:
- simplify signalAllProcesses to use unix.Kill;
- document (container).Signal.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The purpose of this test is to check that there are no extra file
descriptors left open after repeated calls to runContainer. In fact,
the first call to runContainer leaves a few file descriptors opened,
and this is by design.
Previously, this test relied on two things:
1. some other tests were run before it (and thus all such opened-once
file descriptors are already opened);
2. explicitly excluding fd opened to /sys/fs/cgroup.
Now, if we run this test separately, it will fail (because of 1 above).
The same may happen if the tests are run in a random order.
To fix this, add a container run before collection the initial fd list,
so those fds that are opened once are included and won't be reported.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do not set inheritable capabilities in runc spec, runc exec --cap,
and in libcontainer integration tests.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Runtime spec says:
> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.
and sysctl(8) says:
> variable
> The name of a key to read from. An example is
> kernel.ostype. The '/' separator is also accepted in place of a '.'.
Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.
Add some more test data where / is used as a separator.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Add a unit test to check that bind mounts that have a part of its
path non accessible by others still work when using user namespaces.
To do this, we also modify newRoot() to return rootfs directories that
can be traverse by others, so the rootfs created works for all test
(either running in a userns or not).
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
1. Use t.TempDir instead of ioutil.TempDir. This means no need for an
explicit cleanup, which removes some code, including newTestBundle
and newTestRoot.
2. Move newRootfs invocation down to newTemplateConfig, removing a need
for explicit rootfs creation. Also, remove rootfs from tParam as it
is no longer needed (there was a since test case in which two
containers shared the same rootfs, but it does not look like it's
required for the test).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In addition to freezing and thawing a container via Pause/Resume,
there is a way to also do so via Set.
This way was broken though and is being fixed by a few preceding
commits. The test is added to make sure this is fixed and won't regress.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running inside a Docker container, systemd is not available. The
new TestFdLeaksSystemd forgot to include the relevant t.Skip section.
Fixes: a7feb42395 ("libct/int: add TestFdLeaksSystemd")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Add a test to check that container.Run do not leak file descriptors.
Before the previous commit, it fails like this:
exec_test.go:2030: extra fd 8 -> socket:[659703]
exec_test.go:2030: extra fd 11 -> socket:[658715]
exec_test.go:2033: found 2 extra fds after container.Run
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since commit 88e8350de2 the error message is different, so the check
is not working. In addition, for the cgroup v2 case, and it seems that
PID controller is always available these days.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Do not create the same container named "test" over and over.
2. Fix randomization issues when generating container and cgroup names.
The issues were:
* math/rand used without seeding
* complex rand/md5/hexencode sequence
In both cases, replace with nanosecond time encoded with digits and
lowercase letters.
3. Add test name to container and cgroup names. For example, this is
how systemd log has changed:
Before: Started libcontainer container test16ddfwutxgjte.
After: Started libcontainer container TestPidsSystemd-4oaqvr.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a very simple test that checks that container.Run do not leak
opened file descriptors.
In fact it does, so we have to add two exclusions:
1. /sys/fs/cgroup is opened once per lifetime in prepareOpenat2(),
provided that cgroupv2 is used and openat2 is available. This
works as intended ("it's not a bug, it's a feature").
2. ebpf program fd is leaked every time we call setDevices() for
cgroupv2 (iow, every container.Run or container.Set leaks 1 fd).
This needs to be fixed, thus FIXME.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... use the one from unix instead.
Coincidentally, this fixes this warning from gosimple linter:
> libcontainer/integration/exec_test.go:448:2: S1021: should merge variable declaration with assignment on next line (gosimple)
> var netAdminBit uint
> ^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes the following warnings:
> libcontainer/integration/exec_test.go:369:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple)
> outputStatus := string(stdout.Bytes())
> ^
> libcontainer/integration/exec_test.go:422:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple)
> outputStatus := string(stdout.Bytes())
> ^
> libcontainer/integration/exec_test.go:486:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple)
> outputGroups := string(stdout.Bytes())
> ^
> libcontainer/integration/execin_test.go:191:18: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple)
> outputGroups := string(stdout.Bytes())
> ^
> libcontainer/integration/execin_test.go:474:9: S1030: should use stdout.String() instead of string(stdout.Bytes()) (gosimple)
> out := string(stdout.Bytes())
> ^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix a merge issue between 0aa0fae393 ("Kill all processes in cgroup even if init process Wait fails")
& 73d93eeb01 ("libct/int: make newTemplateConfig argument a struct") that
resulted in passing the wrong datatype to newTemplateConfig in
TestPIDHostInitProcessWait.
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
If the cgroup's init process doesn't complete successfully, Wait returns a
non-nil error. We should still kill all the process in the cgroup if process
namespace is shared. Otherwise, it may result in process leak.
Fixes#2632
Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>
It seems that a few tests add a cgroup mount in case userns is not set.
Let's do it inside newTemplateConfig() for all tests.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
...so we can add more fields later.
This commit is mostly courtesy of
sed -i 's/newTemplateConfig(rootfs)/newTemplateConfig(\&tParam{rootfs: rootfs})/g'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>