We will need to pass more slices of fds to these functions in future
patches. Let's add a struct that just contains them all, instead of
adding lot of parameters to these functions.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When starting a new container, and the very last step of executing of a
user process fails (last lines of (*linuxStandardInit).Init), it is too
late to print a proper error since both the log pipe and the init pipe
are closed.
This is partially mitigated by using exec.LookPath() which is supposed
to say whether we will be able to execute or not. Alas, it fails to do
so when the binary to be executed resides on a filesystem mounted with
noexec flag.
A workaround would be to use access(2) with X_OK flag. Alas, it is not
working when runc itself is a setuid (or setgid) binary. In this case,
faccessat2(2) with AT_EACCESS can be used, but it is only available
since Linux v5.8.
So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to
access(2) for non-setuid runc, and do nothing for setuid runc (as there
is nothing we can do). Note that this check if in addition to whatever
exec.LookPath does.
Fixes https://github.com/opencontainers/runc/issues/3520
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the next commit is going to touch this structure, our CI
(lint-extra) is about to complain about improperly named field:
> Warning: var-naming: struct field ContainerId should be ContainerID (revive)
Make it happy.
Brought to use by gopls rename.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The staticcheck linter points out that the err != nil comparison
after system.Exec is always true:
> libcontainer/standard_init_linux.go#L253
> SA4023: this comparison is always true (staticcheck)
> libcontainer/system/linux.go#L43
> SA4023(related information): github.com/opencontainers/runc/libcontainer/system.Exec never returns a nil interface value (staticcheck)
Indeed, Exec either returns an error or does not return at all.
Remove the (useless) check.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
To solve this problem, this patch performs the following:
1. in nsexec.c, it opens the source path in the host userns (so we have
the right permissions to open it) but in the container mntns (so the
kernel cross mntns mount check let us mount it later:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).
2. in nsexec.c, it passes the file descriptors of the source to the
child process with SCM_RIGHTS.
3. In runc-init in Golang, it finishes the mounts while inside the
userns even without access to the some components of the source
paths.
Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.
This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.
This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.
Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).
This patch takes care of using O_CLOEXEC on mount fds, and close them
early.
Fixes: #2484.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
If the container binary to be run is removed in between runc create
and runc start, the latter spits the following error:
> can't exec user process: no such file or directory
This is a bit confusing since we don't see what file is missing.
Wrap the unix.Exec error into os.PathError, like in many other cases,
to provide some context. Remove the error wrapping from
(*linuxStandardInit).Init as it is now redundant.
With this patch, the error is now:
> exec /bin/false: no such file or directory
Reported-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The two exceptions I had to add to codespellrc are:
- CLOS (used by intelrtd);
- creat (syscall name used in tests/integration/testdata/seccomp_*.json).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.
Signed-off-by: Maksim An <maksiman@microsoft.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use fmt.Errorf with %w instead.
Convert the users to the new wrapping.
This fixes an errorlint warning.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
First, add runc --debug exec test cases, very similar to those in
debug.bats but for runc exec (rather than runc run). Do not include json
tests as it is already tested in debug.bats.
Second, add logrus debug to late stages of runc init, and amend the
integration tests to check for those messages. This serves two purposes:
- demonstrate that runc init can be amended with debug logrus which is
properly forwarded to and logged by the parent runc create/run/exec;
- improve the chances to catch the race fixed by the previous commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Work is ongoing in the kernel to support different kernel
keyrings per user namespace. We want to allow SELinux to manage
kernel keyrings inside of the container.
Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.
Container running as container_t:s0:c1,c2 can manage keyrings with the same label.
This change required a revendoring or the SELinux go bindings.
github.com/opencontainers/selinux.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:
* We already have a flag that disables the use of session keyrings
(for older kernels that had system-wide keyring limits and so
on). So disabling it is not a new idea.
* While the primary justification of using session keys *is*
security-based, it's more of a security-by-obscurity protection.
The main defense keyrings have is VFS credentials -- which is
something that users already have better security tools for
(setuid(2) and user namespaces).
* Given the security justification you might argue that we
shouldn't silently ignore this. However, the only way for the
kernel to return -ENOSYS is either being ridiculously old (at
which point we wouldn't work anyway) or that there is a seccomp
profile in place blocking it.
Given that the seccomp profile (if malicious) could very easily
just return 0 or a silly return code (or something even more
clever with seccomp-bpf) and trick us without this patch, there
isn't much of a significant change in how much seccomp can trick
us with or without this patch.
Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.
Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.
Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.
It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels. Adding
the mountlabel will remove these AVC's.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.
This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.
This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.
This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.
https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
A remount of a mount point must include all the current flags or
these will be cleared:
```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```
The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.
In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:
```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:
MS_REMOUNT | MS_BIND | MS_RDONLY
Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```
MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:
```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.
In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.
We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
- /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
- /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This avoid the goimports tool from remove the libcontainer/keys import line due the package name is diferent from folder name
Signed-off-by: Guilherme Rezende <guilhermebr@gmail.com>