This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The two exceptions I had to add to codespellrc are:
- CLOS (used by intelrtd);
- creat (syscall name used in tests/integration/testdata/seccomp_*.json).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.
Signed-off-by: Maksim An <maksiman@microsoft.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use fmt.Errorf with %w instead.
Convert the users to the new wrapping.
This fixes an errorlint warning.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
First, add runc --debug exec test cases, very similar to those in
debug.bats but for runc exec (rather than runc run). Do not include json
tests as it is already tested in debug.bats.
Second, add logrus debug to late stages of runc init, and amend the
integration tests to check for those messages. This serves two purposes:
- demonstrate that runc init can be amended with debug logrus which is
properly forwarded to and logged by the parent runc create/run/exec;
- improve the chances to catch the race fixed by the previous commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Work is ongoing in the kernel to support different kernel
keyrings per user namespace. We want to allow SELinux to manage
kernel keyrings inside of the container.
Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.
Container running as container_t:s0:c1,c2 can manage keyrings with the same label.
This change required a revendoring or the SELinux go bindings.
github.com/opencontainers/selinux.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:
* We already have a flag that disables the use of session keyrings
(for older kernels that had system-wide keyring limits and so
on). So disabling it is not a new idea.
* While the primary justification of using session keys *is*
security-based, it's more of a security-by-obscurity protection.
The main defense keyrings have is VFS credentials -- which is
something that users already have better security tools for
(setuid(2) and user namespaces).
* Given the security justification you might argue that we
shouldn't silently ignore this. However, the only way for the
kernel to return -ENOSYS is either being ridiculously old (at
which point we wouldn't work anyway) or that there is a seccomp
profile in place blocking it.
Given that the seccomp profile (if malicious) could very easily
just return 0 or a silly return code (or something even more
clever with seccomp-bpf) and trick us without this patch, there
isn't much of a significant change in how much seccomp can trick
us with or without this patch.
Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.
Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.
Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.
It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels. Adding
the mountlabel will remove these AVC's.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.
This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.
This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.
This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.
https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
A remount of a mount point must include all the current flags or
these will be cleared:
```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```
The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.
In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:
```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:
MS_REMOUNT | MS_BIND | MS_RDONLY
Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```
MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:
```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.
In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.
We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
- /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
- /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This avoid the goimports tool from remove the libcontainer/keys import line due the package name is diferent from folder name
Signed-off-by: Guilherme Rezende <guilhermebr@gmail.com>
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.
Fixes#818
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
See https://github.com/docker/docker/issues/22252
Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.
If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.
Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Fixes#680
This changes setupRlimit to use the Prlimit syscall (rather than
Setrlimit) and moves the call to the parent process. This is necessary
because Setrlimit would affect the libcontainer consumer if called in
the parent, and would fail if called from the child if the
child process is in a user namespace and the requested rlimit is higher
than that in the parent.
Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.
This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.
With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>