This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from unix.* are always bare and thus can be used directly.
Add //nolint:errorlint annotation to ignore errors such as these:
libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
case errno == unix.ERANGE:
^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if e != unix.EINVAL {
^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if err != unix.EINVAL && err != unix.EPERM {
^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do this for all errors except one from unix.*.
This fixes a bunch of errorlint warnings, like these
libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
if le, ok := err.(Error); ok {
^
libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
lerr, ok := err.(Error)
^
libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
_, ok := err.(*stateTransitionError)
^
libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440fb5 runc bind-mounts the whole
/sys/fs/cgroup into container.
Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.
The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.
Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.
Before:
$ ./runc run aaa
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup
/sys/fs/cgroup/user.slice
/sys/fs/cgroup/user.slice/user-1000.slice
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
...
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 nobody nogroup 0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
142 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory
After:
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup/
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 root root 0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
2 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
577536
[1] https://github.com/opencontainers/runc/issues/2158
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.
[@kolyshkin: rebasing; fix formatting via indent for make validate to pass]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Alas, the EPERM on chdir saga continues...
Unfortunately, the there were two releases between when 5e0e67d76c was released
and when the workaround https://github.com/opencontainers/runc/pull/2712 was added.
Between this, folks started relying on the ability to have a workdir that the container user doesn't have access to.
Since this case was previously valid, we should continue support for it.
Now, we retry the chdir:
Once at the top of the function (to catch cases where the runc user has access, but container user does not)
and once after we setup user (to catch cases where the container user has access, and the runc user does not)
Add a test case for this as well.
Signed-off-by: Peter Hunt <pehunt@redhat.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This reverts most of commit 24c05b7, as otherwise it causes
a few regressions (docker cli, TestDockerSwarmSuite/TestServiceLogsTTY).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The TestExecInTTY test case is sometimes failing like this:
> execin_test.go:332: unexpected carriage-return in output "PID USER TIME COMMAND\r\n 1 root 0:00 cat\r\n 7 root 0:00 ps\r\n"
or this:
> execin_test.go:332: unexpected carriage-return in output "PID USER TIME COMMAND\r\n 1 root 0:00 cat\n 7 root 0:00 ps\n"
(this is easy to repro with `go test -run TestExecInTTY -count 1000`).
This is caused by a race between
- an Init() (in this case it is is (*linuxSetnsInit.Init(), but
(*linuxStandardInit).Init() is no different in this regard),
which creates a pty pair, sends pty master to runc, and execs
the container process,
and
- a parent runc process, which receives the pty master fd and calls
ClearONLCR() on it.
One way of fixing it would be to add a synchronization mechanism
between these two, so Init() won't exec the process until the parent
sets the flag. This seems excessive, though, as we can just move
the ClearONLCR() call to Init(), putting it right after console.NewPty().
Note that bug only happens in the TestExecInTTY test case, but
from looking at the code it seems like it can happen in runc run
or runc exec, too.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I noticed this was the only place in this function where we didn't
handle errors on freezing/thawing. Logging as a warning, consistent
with the other cases.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
commit 5e0e67d76c moved the chdir to be one of the
first steps of finalizing the namespace of the container.
However, this causes issues when the cwd is not accessible by the user running runc, but rather
as the container user.
Thus, setupUser has to happen before we call chdir. setupUser still happens before setting the caps,
so the user should be privileged enough to mitigate the issues fixed in 5e0e67d76c
Signed-off-by: Peter Hunt <pehunt@redhat.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
when exec as root and config.Cwd is not owned by root, exec will fail
because root doesn't have the caps.
So, Chdir should be done before setting the caps.
Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com>
This is a regression from 06f789cf26
when the user namespace was configured without a privileged helper.
To allow a single mapping in an user namespace, it is necessary to set
/proc/self/setgroups to "deny".
For a simple reproducer, the user namespace can be created with
"unshare -r".
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.
Fixes#1677
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Previously we would handle the "unmapped stdio" case by just doing a
simple check, however this didn't handle cases where the overflow_uid
was actually mapped in the user namespace. Instead of doing some
userspace checks, just try to do the fchown(2) and ignore EINVAL
(unmapped) or EPERM (lacking privilege over inode) errors.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Now that rootless containers have support for multiple uid and gid
mappings, allow --user to work as expected. If the user is not mapped,
an error occurs (as usual).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This allows the libcontainer to automatically clean up runc:[1:CHILD]
processes created as part of nsenter.
Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.
!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.
This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).
In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When process config doesnt specify capabilities anywhere, we should not panic
because setting capabilities are optional.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
In container process's Init function, we use
fd + execFifoFilename to open exec fifo, so this
field in init config is never used.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.
To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).
There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.
Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>