There are two very distinct usage scenarios for signalAllProcesses:
* when used from the runc binary ("runc kill" command), the processes
that it kills are not the children of "runc kill", and so calling
wait(2) on each process is totally useless, as it will return ECHLD;
* when used from a program that have created the container (such as
libcontainer/integration test suite), that program can and should call
wait(2), not the signalling code.
So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.
Remove it, and add a proper documentation piece.
Change the integration test accordingly.
PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb00c, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.
While at it:
- simplify signalAllProcesses to use unix.Kill;
- document (container).Signal.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a cosmetic change to improve code readability, making it easier
to distinguish between a local error and the error being returned.
While at it, rename e to err (it was originally called e to not clash
with returned error named err) and ee to err2.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of having newContainerInit return an interface, and let its
caller call Init(), it is easier to call Init directly.
Do that, and rename newContainerInit to containerInit.
I think it makes the code more readable and straightforward.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When given an environment variable that is invalid, it's not a good idea
to output the contents in case they are supposed to be private (though
such a container wouldn't start anyway so it seems unlikely there's a
real way to use this to exfiltrate environment variables you didn't
already know).
Reported-by: Carl Henrik Lunde <chlunde@ifi.uio.no>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In 18c4760a (libct: fixStdioPermissions: skip chown if not needed)
the check whether the STDIO file descriptors point to /dev/null was
removed which can cause /dev/null to change ownership e.g. when using
docker exec on a running container:
$ ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Aug 1 14:12 /dev/null
$ docker exec -u test 0ad6d3064e9d ls
$ ls -l /dev/null
crw-rw-rw- 1 test root 1, 3 Aug 1 14:12 /dev/null
Signed-off-by: Jaroslav Jindrak <dzejrou@gmail.com>
This is needed since the future commits will touch this code, and then
the lint-extra CI job complains.
> libcontainer/factory.go#L245
> var-naming: var fdsJson should be fdsJSON (revive)
and
> libcontainer/init_linux.go#L181
> error-strings: error strings should not be capitalized or end with punctuation or a newline (revive)
and
> notify_socket.go#L94
> receiver-naming: receiver name n should be consistent with previous receiver name s for notifySocket (revive)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the next commit is going to touch this structure, our CI
(lint-extra) is about to complain about improperly named field:
> Warning: var-naming: struct field ContainerId should be ContainerID (revive)
Make it happy.
Brought to use by gopls rename.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of a read-only /dev, it's better to move on and let whatever is
run in a container to handle any possible errors.
This solves runc exec for a user with read-only /dev.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since we already called fstat, we know the current file uid. In case it
is the same as the one we want it to be, there's no point in trying
chown.
Remove the specific /dev/null check, as the above also covers it
(comparing /dev/null uid with itself is true).
This also fixes runc exec with read-only /dev for root user.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use os/file Chown method instead of bare unix.Fchown as it already have
access to underlying fd, and produces nice-looking errors. This allows
us to remove our error wrapping and some linter annotations.
We still use unix.Fstat since os.Stat access to os-specific fields
like uid/gid is not very straightforward. The only change here is to use
file name (rather than fd) in the error text.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When I tried to start a rootless container under a different/wrong user,
I got:
$ ../runc/runc --systemd-cgroup --root /tmp/runc.$$ run 445
ERRO[0000] runc run failed: operation not permitted
This is obviously not good enough. With this commit, the error is:
ERRO[0000] runc run failed: fchown fd 9: operation not permitted
Alas, there are still some code that returns unwrapped errnos from
various unix calls.
This is a followup to commit d8ba4128b2 which wrapped many, but not
all, bare unix errors. Do wrap some more, using either os.PathError or
os.SyscallError.
While at it,
- use os.SyscallError instead of os.NewSyscallError;
- use errors.Is(err, os.ErrXxx) instead of os.IsXxx(err).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
To solve this problem, this patch performs the following:
1. in nsexec.c, it opens the source path in the host userns (so we have
the right permissions to open it) but in the container mntns (so the
kernel cross mntns mount check let us mount it later:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).
2. in nsexec.c, it passes the file descriptors of the source to the
child process with SCM_RIGHTS.
3. In runc-init in Golang, it finishes the mounts while inside the
userns even without access to the some components of the source
paths.
Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.
This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.
This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.
Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).
This patch takes care of using O_CLOEXEC on mount fds, and close them
early.
Fixes: #2484.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from unix.* are always bare and thus can be used directly.
Add //nolint:errorlint annotation to ignore errors such as these:
libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
case errno == unix.ERANGE:
^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if e != unix.EINVAL {
^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if err != unix.EINVAL && err != unix.EPERM {
^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do this for all errors except one from unix.*.
This fixes a bunch of errorlint warnings, like these
libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
if le, ok := err.(Error); ok {
^
libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
lerr, ok := err.(Error)
^
libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
_, ok := err.(*stateTransitionError)
^
libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440fb5 runc bind-mounts the whole
/sys/fs/cgroup into container.
Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.
The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.
Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.
Before:
$ ./runc run aaa
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup
/sys/fs/cgroup/user.slice
/sys/fs/cgroup/user.slice/user-1000.slice
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
...
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 nobody nogroup 0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
142 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory
After:
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup/
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 root root 0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
2 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
577536
[1] https://github.com/opencontainers/runc/issues/2158
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.
[@kolyshkin: rebasing; fix formatting via indent for make validate to pass]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Alas, the EPERM on chdir saga continues...
Unfortunately, the there were two releases between when 5e0e67d76c was released
and when the workaround https://github.com/opencontainers/runc/pull/2712 was added.
Between this, folks started relying on the ability to have a workdir that the container user doesn't have access to.
Since this case was previously valid, we should continue support for it.
Now, we retry the chdir:
Once at the top of the function (to catch cases where the runc user has access, but container user does not)
and once after we setup user (to catch cases where the container user has access, and the runc user does not)
Add a test case for this as well.
Signed-off-by: Peter Hunt <pehunt@redhat.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This reverts most of commit 24c05b7, as otherwise it causes
a few regressions (docker cli, TestDockerSwarmSuite/TestServiceLogsTTY).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The TestExecInTTY test case is sometimes failing like this:
> execin_test.go:332: unexpected carriage-return in output "PID USER TIME COMMAND\r\n 1 root 0:00 cat\r\n 7 root 0:00 ps\r\n"
or this:
> execin_test.go:332: unexpected carriage-return in output "PID USER TIME COMMAND\r\n 1 root 0:00 cat\n 7 root 0:00 ps\n"
(this is easy to repro with `go test -run TestExecInTTY -count 1000`).
This is caused by a race between
- an Init() (in this case it is is (*linuxSetnsInit.Init(), but
(*linuxStandardInit).Init() is no different in this regard),
which creates a pty pair, sends pty master to runc, and execs
the container process,
and
- a parent runc process, which receives the pty master fd and calls
ClearONLCR() on it.
One way of fixing it would be to add a synchronization mechanism
between these two, so Init() won't exec the process until the parent
sets the flag. This seems excessive, though, as we can just move
the ClearONLCR() call to Init(), putting it right after console.NewPty().
Note that bug only happens in the TestExecInTTY test case, but
from looking at the code it seems like it can happen in runc run
or runc exec, too.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I noticed this was the only place in this function where we didn't
handle errors on freezing/thawing. Logging as a warning, consistent
with the other cases.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
commit 5e0e67d76c moved the chdir to be one of the
first steps of finalizing the namespace of the container.
However, this causes issues when the cwd is not accessible by the user running runc, but rather
as the container user.
Thus, setupUser has to happen before we call chdir. setupUser still happens before setting the caps,
so the user should be privileged enough to mitigate the issues fixed in 5e0e67d76c
Signed-off-by: Peter Hunt <pehunt@redhat.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
when exec as root and config.Cwd is not owned by root, exec will fail
because root doesn't have the caps.
So, Chdir should be done before setting the caps.
Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com>
This is a regression from 06f789cf26
when the user namespace was configured without a privileged helper.
To allow a single mapping in an user namespace, it is necessary to set
/proc/self/setgroups to "deny".
For a simple reproducer, the user namespace can be created with
"unshare -r".
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.
Fixes#1677
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>