Commit Graph

68 Commits

Author SHA1 Message Date
Kir Kolyshkin
c690b66d7f int/linux: add/use Exec
Drop the libcontainer/system/exec, and use the linux.Exec instead.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
8598f6ec4a Merge pull request #4354 from ningmingxiao/dev3
skip read /proc/filesystems if process_label is null
2025-03-24 12:09:03 -07:00
Kir Kolyshkin
539315534f libct: log a warning on join session keyring failure
This addresses a TODO item added by commit 40f146841
("keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)"),
as we do have runc init logging working fine for quite some time.

While at it, fix a typo in a comment (standart -> standard).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-13 08:42:22 -07:00
ningmingxiao
6a3f8ea3b4 skip read /proc/filesystems if process_label is null
Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>
2025-02-12 12:44:39 +08:00
Kir Kolyshkin
99f9ed94dc runc exec: fix setting process.Scheduler
Commit 770728e1 added Scheduler field into both Config and Process,
but forgot to add a mechanism to actually use Process.Scheduler.
As a result, runc exec does not set Process.Scheduler ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
b9114d91e2 runc exec: fix setting process.ioPriority
Commit bfbd0305b added IOPriority field into both Config and Process,
but forgot to add a mechanism to actually use Process.IOPriority.
As a result, runc exec does not set Process.IOPriority ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
7dc2486889 libct: switch to numeric UID/GID/groups
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):

> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.

Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.

Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).

What happens next is:

1. runc start/run/exec converts those credentials to strings (a User
   string containing "UID:GID", and a []string for additional GIDs) and
   passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
   possible (parsing container's /etc/passwd and /etc/group).

All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).

The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).

Let's remove this back and forth translation, hopefully increasing
runc exec performance.

The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).

PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.

Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.

For 3998.

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Kir Kolyshkin
06f1e07655 libct: speedup process.Env handling
The current implementation sets all the environment variables passed in
Process.Env in the current process, one by one, then uses os.Environ to
read those back.

As pointed out in [1], this is slow, as runc calls os.Setenv for every
variable, and there may be a few thousands of those. Looking into how
os.Setenv is implemented, it is indeed slow, especially when cgo is
enabled.

Looking into why it was implemented the way it is, I found commit
9744d72c and traced it to [2], which discusses the actual reasons.
It boils down to these two:

 - HOME is not passed into container as it is set in setupUser by
   os.Setenv and has no effect on config.Env;
 - there is a need to deduplicate the environment variables.

Yet it was decided in [2] to not go ahead with this patch, but
later [3] was opened with the carry of this patch, and merged.

Now, from what I see:

1. Passing environment to exec is way faster than using os.Setenv and
   os.Environ (tests show ~20x speed improvement in a simple Go test,
   and ~3x improvement in real-world test, see below).
2. Setting environment variables in the runc context may result is some
   ugly side effects (think GODEBUG, LD_PRELOAD, or _LIBCONTAINER_*).
3. Nothing in runtime spec says that the environment needs to be
   deduplicated, or the order of preference (whether the first or the
   last value of a variable with the same name is to be used). We should
   stick to what we have in order to maintain backward compatibility.

So, this patch:
 - switches to passing env directly to exec;
 - adds deduplication mechanism to retain backward compatibility;
 - takes care to set PATH from process.Env in the current process
   (so that supplied PATH is used to find the binary to execute),
   also to retain backward compatibility;
 - adds HOME to process.Env if not set;
 - ensures any StartContainer CommandHook entries with no environment
   set explicitly are run with the same environment as before. Thanks
   to @lifubang who noticed that peculiarity.

The benchmark added by the previous commit shows ~3x improvement:

	                │   before    │                after                 │
	                │   sec/op    │    sec/op     vs base                │
	ExecInBigEnv-20   61.53m ± 1%   21.87m ± 16%  -64.46% (p=0.000 n=10)

[1]: https://github.com/opencontainers/runc/pull/1983
[2]: https://github.com/docker-archive/libcontainer/pull/418
[3]: https://github.com/docker-archive/libcontainer/pull/432

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
5d3942eec3 libct: unify IOPriority setting
For some reason, io priority is set in different places between runc
start/run and runc exec:

 - for runc start/run, it is done in the middle of (*linuxStandardInit).Init,
   close to the place where we exec runc init.
 - for runc exec, it is done much earlier, in (*setnsProcess) start().

Let's move setIOPriority call for runc exec to (*linuxSetnsInit).Init,
so it is in the same logical place as for runc start/run.

Also, move the function itself to init_linux.go as it's part of init.

Should not have any visible effect, except part of runc init is run with
a different I/O priority.

While at it, rename setIOPriority to setupIOPriority, and make it accept
the whole *configs.Config, for uniformity with other similar functions.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:15:31 -08:00
Kir Kolyshkin
2dc3ea4b87 libct: simplify setIOPriority/setupScheduler calls
Move the nil check inside, simplifying the callers.

Fixes: bfbd0305 ("Add I/O priority")
Fixes: 770728e1 ("Support `process.scheduler`")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:06:20 -08:00
Kir Kolyshkin
5586d7caa1 libct: rm obsoleted comment
This was added by commit f2f16213e when runc-dmz was still a thing.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-10-29 17:11:56 -07:00
lifubang
871057d863 drop runc-dmz solution according to overlay solution
Because we have the overlay solution, we can drop runc-dmz binary
solution since it has too many limitations.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-10-28 15:18:07 +00:00
Kir Kolyshkin
a1e87f8d76 libct: rm eaccess
It is not needed since Go 1.20 (which was released in February 2023 and
is no longer supported since February 2024).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-06-07 10:18:59 -07:00
ls-ggg
da68c8e37b libct: clean cached rlimit nofile in go runtime
As reported in issue #4195, the new version(since 1.19) of go runtime
will cache rlimit-nofile. Before executing execve, the rlimit-nofile
of the process will be restored with the cache. In runc, this will
cause the rlimit-nofile set by the parent process for the container
to become invalid. It can be solved by clearing the cache.

Signed-off-by: ls-ggg <335814617@qq.com>
(cherry picked from commit f9f8abf310)
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:13 +00:00
lifubang
a853a82677 runc exec: setupRlimits after syscall.rlimit.init() completed
Issue: https://github.com/opencontainers/runc/issues/4195
Since https://go-review.googlesource.com/c/go/+/476097, there is
a get/set race between runc exec and syscall.rlimit.init, so we
need to call setupRlimits after syscall.rlimit.init() completed.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:07 +00:00
lifubang
d075058717 close the sync pipe explicitly in exec
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-02-08 01:45:06 +00:00
Aleksa Sarai
02120488a4 Merge pull request from GHSA-xr7r-f8xq-vfvv
fix GHSA-xr7r-f8xq-vfvv and harden fd leaks
2024-02-01 07:04:29 +11:00
Aleksa Sarai
f2f16213e1 init: close internal fds before execve
If we leak a file descriptor referencing the host filesystem, an
attacker could use a /proc/self/fd magic-link as the source for execve
to execute a host binary in the container. This would allow the binary
itself (or a process inside the container in the 'runc exec' case) to
write to a host binary, leading to a container escape.

The simple solution is to make sure we close all file descriptors
immediately before the execve(2) step. Doing this earlier can lead to very
serious issues in Go (as file descriptors can be reused, any (*os.File)
reference could start silently operating on a different file) so we have
to do it as late as possible.

Unfortunately, there are some Go runtime file descriptors that we must
not close (otherwise the Go scheduler panics randomly). The only way of
being sure which file descriptors cannot be closed is to sneakily
go:linkname the runtime internal "internal/poll.IsPollDescriptor"
function. This is almost certainly not recommended but there isn't any
other way to be absolutely sure, while also closing any other possible
files.

In addition, we can keep the logrus forwarding logfd open because you
cannot execve a pipe and the contents of the pipe are so restricted
(JSON-encoded in a format we pick) that it seems unlikely you could even
construct shellcode. Closing the logfd causes issues if there is an
error returned from execve.

In mainline runc, runc-dmz protects us against this attack because the
intermediate execve(2) closes all of the O_CLOEXEC internal runc file
descriptors and thus runc-dmz cannot access them to attack the host.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-01-24 00:20:58 +11:00
Aleksa Sarai
7094efb192 init: use *os.File for passed file descriptors
While it doesn't make much of a practical difference, it seems far more
reasonable to use os.NewFile to wrap all of our passed file descriptors
to make sure they're tracked by the Go runtime and that we don't
double-close them.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-01-22 17:34:14 +11:00
lfbzhm
95a93c132c Merge pull request #4045 from fuweid/support-pidfd-socket
[feature request] *: introduce pidfd-socket flag
2023-11-22 09:13:55 +08:00
Wei Fu
94505a046a *: introduce pidfd-socket flag
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.

And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-21 18:28:50 +08:00
Zheao.Li
98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
utam0k
770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Aleksa Sarai
8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
lifubang
dac4171746 runc-dmz: reduce memfd binary cloning cost with small C binary
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.

This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).

It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.

This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.

[1]: bfedb58925

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-22 15:38:19 +10:00
Aleksa Sarai
f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Wang-squirrel
7b4c3fc111 Add support for umask when exec container
Signed-off-by: WangXiaoSong <wang.xiaosong1@zte.com.cn>
2023-02-23 10:04:47 +08:00
Kir Kolyshkin
bb6a838876 libct: initContainer: rename Id -> ID
Since the next commit is going to touch this structure, our CI
(lint-extra) is about to complain about improperly named field:

>  Warning: var-naming: struct field ContainerId should be ContainerID (revive)

Make it happy.

Brought to use by gopls rename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 18:59:47 -08:00
Kir Kolyshkin
e395d2dc50 libct: Init: remove LockOSThread
This call is already made in init.go, no need for a duplicate.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-05 19:19:40 -07:00
Alban Crequy
2b025c0173 Implement Seccomp Notify
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.

A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.

As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:

	writing syncT <act>: writing syncT: <err>

By adjusting the code path, now they just look like this
	writing syncT <act>: <err>

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-09-07 13:04:24 +02:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Kir Kolyshkin
e918d02139 libcontainer: rm own error system
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.

Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.

While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".

A few things that are worth mentioning:

1. We lose stack traces (which were never shown anyway).

2. Users of libcontainer that relied on particular errors (like
   ContainerNotExists) need to switch to using errors.Is with
   the new errors defined in error.go.

3. encoding/json is unable to unmarshal the built-in error type,
   so we have to introduce initError and wrap the errors into it
   (basically passing the error as a string). This is the same
   as it was before, just a tad simpler (actually the initError
   is a type that got removed in commit afa844311; also suddenly
   ierr variable name makes sense now).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 10:21:04 -07:00
Kir Kolyshkin
a7cfb23b88 *: stop using pkg/errors
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
c6fed264da libct/keys: stop using pkg/errors
Use fmt.Errorf with %w instead.

Convert the users to the new wrapping.

This fixes an errorlint warning.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Sebastiaan van Stijn
b45fbd43b8 errcheck: libcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:19:26 +02:00
Kir Kolyshkin
2726146b04 runc --debug: more tests
First, add runc --debug exec test cases, very similar to those in
debug.bats but for runc exec (rather than runc run). Do not include json
tests as it is already tested in debug.bats.

Second, add logrus debug to late stages of runc init, and amend the
integration tests to check for those messages. This serves two purposes:

 - demonstrate that runc init can be amended with debug logrus which is
   properly forwarded to and logged by the parent runc create/run/exec;

 - improve the chances to catch the race fixed by the previous commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-25 19:19:20 -07:00
Kir Kolyshkin
201d60c51d runc run/start/exec: fix init log forwarding race
Sometimes debug.bats test cases are failing like this:

> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> #   `[[ "${output}" == *"child process in init()"* ]]' failed

It happens more when writing to disk.

This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.

ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.

So, to fix the failure described above, we need to:

 1. Make runc create/run/exec wait for ForwardLogs to finish;

 2. Make runc init close its log pipe file descriptor (i.e.
    the one which value is passed in _LIBCONTAINER_LOGPIPE
    environment variable).

This is exactly what this commit does:

 1. Amend ForwardLogs to return a channel, and wait for it in start().

 2. In runc init, save the log fd and close it as late as possible.

PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-25 19:18:55 -07:00
Sebastiaan van Stijn
8bf216728c use string-concatenation instead of sprintf for simple cases
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-30 10:51:59 +02:00
Sebastiaan van Stijn
b48bbdd08d vendor: opencontainers/selinux v1.5.1, update deprecated uses
full diff: https://github.com/opencontainers/selinux/v1.4.0...v1.5.1

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-05 15:53:40 +02:00
Daniel J Walsh
cd96170c10 Need to setup labeling of kernel keyrings.
Work is ongoing in the kernel to support different kernel
keyrings per user namespace.  We want to allow SELinux to manage
kernel keyrings inside of the container.

Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.

Container running as container_t:s0:c1,c2 can manage keyrings with the same label.

This change required a revendoring or the SELinux go bindings.

github.com/opencontainers/selinux.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-03-13 17:57:30 -04:00
Michael Crosby
b1068fb925 Merge pull request #1814 from rhatdan/selinux
SELinux labels are tied to the thread
2018-11-05 10:00:11 -05:00
Aleksa Sarai
40f1468413 keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:

  * We already have a flag that disables the use of session keyrings
    (for older kernels that had system-wide keyring limits and so
    on). So disabling it is not a new idea.

  * While the primary justification of using session keys *is*
    security-based, it's more of a security-by-obscurity protection.
    The main defense keyrings have is VFS credentials -- which is
    something that users already have better security tools for
    (setuid(2) and user namespaces).

  * Given the security justification you might argue that we
    shouldn't silently ignore this. However, the only way for the
    kernel to return -ENOSYS is either being ridiculously old (at
    which point we wouldn't work anyway) or that there is a seccomp
    profile in place blocking it.

    Given that the seccomp profile (if malicious) could very easily
    just return 0 or a silly return code (or something even more
    clever with seccomp-bpf) and trick us without this patch, there
    isn't much of a significant change in how much seccomp can trick
    us with or without this patch.

Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.

Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-17 21:38:30 +10:00
Daniel J Walsh
aa3fee6c80 SELinux labels are tied to the thread
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.

Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.

It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-06-11 08:34:58 -04:00
Aleksa Sarai
1f32fff46d setns init: delay seccomp as late as possible
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:30 +10:00
Tobias Klauser
4019833d46 libcontainer: use PR_SET_NO_NEW_PRIVS from x/sys/unix
Use PR_SET_NO_NEW_PRIVS defined in golang.org/x/sys/unix instead of
manually defining it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-13 15:31:33 +02:00
Tobias Klauser
553016d7da Use Prctl() from x/sys/unix instead of own wrapper
Use unix.Prctl() instead of reimplemnting it as system.Prctl().

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-07 15:03:15 +02:00
Qiang Huang
5e7b48f7c0 Use opencontainers/selinux package
It's splitted as a separate project.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-23 08:21:19 +08:00
Michael Crosby
00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Aleksa Sarai
e034cedce7 libcontainer: init: only pass stateDirFd when creating a container
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.

To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).

There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.

Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-02-02 00:41:11 +11:00
Michael Crosby
5d93fed3d2 Set init processes as non-dumpable
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.

This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.

This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.

This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.

https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318

The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-01-11 09:56:56 -08:00