Commit Graph

161 Commits

Author SHA1 Message Date
lifubang
a56f2bc836 libct: we should set envs after we are in the jail of the container
Because we have to set a default HOME env for the current container
user, so we should set it after we are in the jail of the container,
or else we'll use host's `/etc/passwd` to get a wrong HOME value.
Please see: #4688.

Signed-off-by: lifubang <lifubang@acmcoder.com>
(cherry picked from commit bf38646497)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-04-01 17:25:45 -07:00
Kir Kolyshkin
10ca66bff5 runc exec: implement CPU affinity
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261

CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
   Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
   applied to a given exec and overrides the value from (1).

Add some basic tests.

Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.

Because of the above,
 - it's impossible to really test initial CPU affinity without adding
   debug logging to libcontainer/nsenter;
 - for older kernels, there can be a brief moment when exec's affinity
   is different than either initial or final affinity being set;
 - exec's final CPU affinity, if not specified, can be different
   depending on the kernel, therefore we don't test it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-02 19:17:41 -08:00
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Kir Kolyshkin
99f9ed94dc runc exec: fix setting process.Scheduler
Commit 770728e1 added Scheduler field into both Config and Process,
but forgot to add a mechanism to actually use Process.Scheduler.
As a result, runc exec does not set Process.Scheduler ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
b9114d91e2 runc exec: fix setting process.ioPriority
Commit bfbd0305b added IOPriority field into both Config and Process,
but forgot to add a mechanism to actually use Process.IOPriority.
As a result, runc exec does not set Process.IOPriority ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
73849e797f libct: simplify Caps inheritance
For all other properties that are available in both Config and Process,
the merging is performed by newInitConfig.

Let's do the same for Capabilities for the sake of code uniformity.

Also, thanks to the previous commit, we no longer have to make sure we
do not call capabilities.New(nil).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
f26ec92221 libct: rm Rootless* properties from initConfig
They are passed in initConfig twice, so it does not make sense.

NB: the alternative to that would be to remove Config field from
initConfig, but it results in a much bigger patch and more maintenance
down the road.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
2a86c35768 libct: document initConfig and friends
This is one of the dark corners of runc / libcontainer, so let's shed
some light on it.

initConfig is a structure which is filled in [mostly] by newInitConfig,
and one of its hidden aspects is it contains a process config which is
the result of merge between the container and the process configs.

Let's document how all this happens, where the fields are coming from,
which one has a preference, and how it all works.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
52f702af56 libct: earlier Rootless vs AdditionalGroups check
Since the UID/GID/AdditonalGroups fields are now numeric,
we can address the following TODO item in the code (added
by commit d2f49696 back in 2016):

> TODO: We currently can't do
> this check earlier, but if libcontainer.Process.User was typesafe
> this might work.

Move the check to much earlier phase, when we're preparing
to start a process in a container.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Kir Kolyshkin
7dc2486889 libct: switch to numeric UID/GID/groups
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):

> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.

Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.

Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).

What happens next is:

1. runc start/run/exec converts those credentials to strings (a User
   string containing "UID:GID", and a []string for additional GIDs) and
   passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
   possible (parsing container's /etc/passwd and /etc/group).

All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).

The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).

Let's remove this back and forth translation, hopefully increasing
runc exec performance.

The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).

PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.

Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.

For 3998.

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Kir Kolyshkin
06f1e07655 libct: speedup process.Env handling
The current implementation sets all the environment variables passed in
Process.Env in the current process, one by one, then uses os.Environ to
read those back.

As pointed out in [1], this is slow, as runc calls os.Setenv for every
variable, and there may be a few thousands of those. Looking into how
os.Setenv is implemented, it is indeed slow, especially when cgo is
enabled.

Looking into why it was implemented the way it is, I found commit
9744d72c and traced it to [2], which discusses the actual reasons.
It boils down to these two:

 - HOME is not passed into container as it is set in setupUser by
   os.Setenv and has no effect on config.Env;
 - there is a need to deduplicate the environment variables.

Yet it was decided in [2] to not go ahead with this patch, but
later [3] was opened with the carry of this patch, and merged.

Now, from what I see:

1. Passing environment to exec is way faster than using os.Setenv and
   os.Environ (tests show ~20x speed improvement in a simple Go test,
   and ~3x improvement in real-world test, see below).
2. Setting environment variables in the runc context may result is some
   ugly side effects (think GODEBUG, LD_PRELOAD, or _LIBCONTAINER_*).
3. Nothing in runtime spec says that the environment needs to be
   deduplicated, or the order of preference (whether the first or the
   last value of a variable with the same name is to be used). We should
   stick to what we have in order to maintain backward compatibility.

So, this patch:
 - switches to passing env directly to exec;
 - adds deduplication mechanism to retain backward compatibility;
 - takes care to set PATH from process.Env in the current process
   (so that supplied PATH is used to find the binary to execute),
   also to retain backward compatibility;
 - adds HOME to process.Env if not set;
 - ensures any StartContainer CommandHook entries with no environment
   set explicitly are run with the same environment as before. Thanks
   to @lifubang who noticed that peculiarity.

The benchmark added by the previous commit shows ~3x improvement:

	                │   before    │                after                 │
	                │   sec/op    │    sec/op     vs base                │
	ExecInBigEnv-20   61.53m ± 1%   21.87m ± 16%  -64.46% (p=0.000 n=10)

[1]: https://github.com/opencontainers/runc/pull/1983
[2]: https://github.com/docker-archive/libcontainer/pull/418
[3]: https://github.com/docker-archive/libcontainer/pull/432

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
7334ee01e6 libct/configs: rm IOPrioClassMapping
This is an internal implementation detail and should not be either
public or visible.

Amend setIOPriority to do own class conversion.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:17:44 -08:00
Kir Kolyshkin
5d3942eec3 libct: unify IOPriority setting
For some reason, io priority is set in different places between runc
start/run and runc exec:

 - for runc start/run, it is done in the middle of (*linuxStandardInit).Init,
   close to the place where we exec runc init.
 - for runc exec, it is done much earlier, in (*setnsProcess) start().

Let's move setIOPriority call for runc exec to (*linuxSetnsInit).Init,
so it is in the same logical place as for runc start/run.

Also, move the function itself to init_linux.go as it's part of init.

Should not have any visible effect, except part of runc init is run with
a different I/O priority.

While at it, rename setIOPriority to setupIOPriority, and make it accept
the whole *configs.Config, for uniformity with other similar functions.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:15:31 -08:00
Kir Kolyshkin
2dc3ea4b87 libct: simplify setIOPriority/setupScheduler calls
Move the nil check inside, simplifying the callers.

Fixes: bfbd0305 ("Add I/O priority")
Fixes: 770728e1 ("Support `process.scheduler`")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:06:20 -08:00
Kir Kolyshkin
a56f85f87b libct/*: switch from configs to cgroups
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-11 19:08:40 -08:00
lifubang
871057d863 drop runc-dmz solution according to overlay solution
Because we have the overlay solution, we can drop runc-dmz binary
solution since it has too many limitations.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-10-28 15:18:07 +00:00
Kir Kolyshkin
f2d56241d8 Merge pull request #4405 from amghazanfari/main
replace strings.SplitN with strings.Cut
2024-10-04 14:01:23 -07:00
Amir M. Ghazanfari
faffe1b9ee replace strings.SplitN with strings.Cut
Signed-off-by: Amir M. Ghazanfari <a.m.ghazanfari76@gmail.com>
2024-09-28 10:02:21 +03:30
lifubang
10c951e335 add ErrCgroupNotExist
For some rootless container, runc has no access to cgroup,
But the container is still running. So we should return the
`ErrNotRunning` and `ErrCgroupNotExist` error seperatlly.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-09-23 23:27:35 +00:00
Akihiro Suda
429e06a518 libct: Signal: honor RootlessCgroups
`signalAllProcesses()` depends on the cgroup and is expected to fail
when runc is running in rootless without an access to the cgroup.

When `RootlessCgroups` is set to `true`, runc just ignores the error
from `signalAllProcesses` and may leak some processes running.
(See the comments in PR 4395)
In the future, runc should walk the process tree to avoid such a leak.

Note that `RootlessCgroups` is a misnomer; it is set to `false` despite
the name when cgroup v2 delegation is configured.
This is expected to be renamed in a separate commit.

Fix issue 4394

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-09-11 03:54:52 +09:00
Akihiro Suda
e7848482e2 Revert "libcontainer: seccomp: pass around *os.File for notifyfd"
This reverts commit 20b95f23ca.

> Conflicts:
>	libcontainer/init_linux.go

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-07-03 17:28:12 +09:00
Kir Kolyshkin
584afc6756 libct/system: ClearRlimitNofileCache for go 1.23
Go 1.23 tightens access to internal symbols, and even puts runc into
"hall of shame" for using an internal symbol (recently added by commit
da68c8e3). So, while not impossible, it becomes harder to access those
internal symbols, and it is a bad idea in general.

Since Go 1.23 includes https://go.dev/cl/588076, we can clean the
internal rlimit cache by setting the RLIMIT_NOFILE for ourselves,
essentially disabling the rlimit cache.

Once Go 1.22 is no longer supported, we will remove the go:linkname hack.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-06-01 13:02:29 -07:00
ls-ggg
da68c8e37b libct: clean cached rlimit nofile in go runtime
As reported in issue #4195, the new version(since 1.19) of go runtime
will cache rlimit-nofile. Before executing execve, the rlimit-nofile
of the process will be restored with the cache. In runc, this will
cause the rlimit-nofile set by the parent process for the container
to become invalid. It can be solved by clearing the cache.

Signed-off-by: ls-ggg <335814617@qq.com>
(cherry picked from commit f9f8abf310)
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:13 +00:00
Kir Kolyshkin
0eb8bb5f66 Format sources with gofumpt v0.6
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-04-25 08:39:26 -07:00
Aleksa Sarai
02120488a4 Merge pull request from GHSA-xr7r-f8xq-vfvv
fix GHSA-xr7r-f8xq-vfvv and harden fd leaks
2024-02-01 07:04:29 +11:00
Kir Kolyshkin
8454bbb613 Merge pull request #4175 from cyphar/fd-file-switch
init: use *os.File for passed file descriptors
2024-01-31 10:40:28 -08:00
lifubang
35aa63ea87 never send procError after the socket closed
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-01-25 04:52:11 +00:00
Aleksa Sarai
f2f16213e1 init: close internal fds before execve
If we leak a file descriptor referencing the host filesystem, an
attacker could use a /proc/self/fd magic-link as the source for execve
to execute a host binary in the container. This would allow the binary
itself (or a process inside the container in the 'runc exec' case) to
write to a host binary, leading to a container escape.

The simple solution is to make sure we close all file descriptors
immediately before the execve(2) step. Doing this earlier can lead to very
serious issues in Go (as file descriptors can be reused, any (*os.File)
reference could start silently operating on a different file) so we have
to do it as late as possible.

Unfortunately, there are some Go runtime file descriptors that we must
not close (otherwise the Go scheduler panics randomly). The only way of
being sure which file descriptors cannot be closed is to sneakily
go:linkname the runtime internal "internal/poll.IsPollDescriptor"
function. This is almost certainly not recommended but there isn't any
other way to be absolutely sure, while also closing any other possible
files.

In addition, we can keep the logrus forwarding logfd open because you
cannot execve a pipe and the contents of the pipe are so restricted
(JSON-encoded in a format we pick) that it seems unlikely you could even
construct shellcode. Closing the logfd causes issues if there is an
error returned from execve.

In mainline runc, runc-dmz protects us against this attack because the
intermediate execve(2) closes all of the O_CLOEXEC internal runc file
descriptors and thus runc-dmz cannot access them to attack the host.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-01-24 00:20:58 +11:00
Aleksa Sarai
8e1cd2f56d init: verify after chdir that cwd is inside the container
If a file descriptor of a directory in the host's mount namespace is
leaked to runc init, a malicious config.json could use /proc/self/fd/...
as a working directory to allow for host filesystem access after the
container runs. This can also be exploited by a container process if it
knows that an administrator will use "runc exec --cwd" and the target
--cwd (the attacker can change that cwd to be a symlink pointing to
/proc/self/fd/... and wait for the process to exec and then snoop on
/proc/$pid/cwd to get access to the host). The former issue can lead to
a critical vulnerability in Docker and Kubernetes, while the latter is a
container breakout.

We can (ab)use the fact that getcwd(2) on Linux detects this exact case,
and getcwd(3) and Go's Getwd() return an error as a result. Thus, if we
just do os.Getwd() after chdir we can easily detect this case and error
out.

In runc 1.1, a /sys/fs/cgroup handle happens to be leaked to "runc
init", making this exploitable. On runc main it just so happens that the
leaked /sys/fs/cgroup gets clobbered and thus this is only consistently
exploitable for runc 1.1.

Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[refactored the implementation and added more comments]
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-01-24 00:20:58 +11:00
Aleksa Sarai
7094efb192 init: use *os.File for passed file descriptors
While it doesn't make much of a practical difference, it seems far more
reasonable to use os.NewFile to wrap all of our passed file descriptors
to make sure they're tracked by the Go runtime and that we don't
double-close them.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-01-22 17:34:14 +11:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Aleksa Sarai
9387eac3a5 init: don't pre-flight-check the set[ug]id arguments
While we do cache the mappings when using userns paths, there's no need
to do this in this particular case, since we are in the namespace and
set[ug]id() give unambiguous EINVAL error codes if the id is unmapped.
This appears to also be the only code which does Host[UG]ID calculations
from inside "runc init".

Ref: 1a5fdc1c5f ("init: support setting -u with rootless containers")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:08 +11:00
Kir Kolyshkin
dcf1b731f5 runc kill: fix sending KILL to non-pidns container
Commit f8ad20f made it impossible to kill leftover processes in a
stopped container that does not have its own PID namespace. In other
words, if a container init is gone, it is no longer possible to use
`runc kill` to kill the leftover processes.

Fix this by moving the check if container init exists to after the
special case of handling the container without own PID namespace.

While at it, fix the minor issue introduced by commit 9583b3d:
if signalAllProcesses is used, there is no need to thaw the
container (as freeze/thaw is either done in signalAllProcesses already,
or not needed at all).

Also, make signalAllProcesses return an error early if the container
cgroup does not exist (as it relies on it to do its job). This way, the
error message returned is more generic and easier to understand
("container not running" instead of "can't open file").

Finally, add a test case.

Fixes: f8ad20f
Fixes: 9583b3d
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-11-27 09:15:39 -08:00
lfbzhm
95a93c132c Merge pull request #4045 from fuweid/support-pidfd-socket
[feature request] *: introduce pidfd-socket flag
2023-11-22 09:13:55 +08:00
Wei Fu
94505a046a *: introduce pidfd-socket flag
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.

And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-21 18:28:50 +08:00
Zheao.Li
98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Akihiro Suda
0274ca2580 Merge pull request #4025 from lifubang/feat-sched-carry-3962
[Carry 3962] Support `process.scheduler`
2023-10-12 08:07:50 +09:00
Bjorn Neergaard
6f7266c3f7 libcontainer: drop system.Setxid
Since Go 1.16, [Go issue 1435][1] is solved, and the stdlib syscall
implementations work on Linux. While they are a bit more
flexible/heavier-weight than the implementations that were copied to
libcontainer/system (working across all threads), we compile with Cgo,
and using the libc wrappers should be just as suitable.

  [1]: https://github.com/golang/go/issues/1435

Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
2023-10-11 13:04:34 -06:00
utam0k
770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Kir Kolyshkin
6538e6d0bd libct: fix a typo
syncrhonisation ==> synchronisation

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-10-03 17:51:54 -07:00
Aleksa Sarai
8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
Aleksa Sarai
ccc76713a7 sync: rename procResume -> procHooksDone
The old name was quite confusing, and with the addition of the
procMountPlease sync message there are now multiple sync messages that
are related to "resuming" runc-init.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:02:11 +08:00
lifubang
dac4171746 runc-dmz: reduce memfd binary cloning cost with small C binary
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.

This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).

It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.

This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.

[1]: bfedb58925

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-22 15:38:19 +10:00
Sebastiaan van Stijn
ca32014adb migrate libcontainer/user to github.com/moby/sys/user
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2023-09-19 10:22:23 +02:00
Kir Kolyshkin
f62f0bdfbf Remove nolint annotations for unix errno comparisons
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.

Thus, these annotations are no longer necessary. Hooray!

[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 17:28:10 -07:00
Aleksa Sarai
20b95f23ca libcontainer: seccomp: pass around *os.File for notifyfd
*os.File is correctly tracked by the garbage collector, and there's no
need to use raw file descriptors for this code.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-15 19:54:24 -07:00
Aleksa Sarai
f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin
883aef789b libct/init: unify init, fix its error logic
This commit does two things:

1. Consolidate StartInitialization calling logic into Init().
2. Fix init error handling logic.

The main issues at hand are:
- the "unable to convert _LIBCONTAINER_INITPIPE" error from
  StartInitialization is never shown;
- errors from WriteSync and WriteJSON are never shown;
- the StartInit calling code is triplicated;
- using panic is questionable.

Generally, our goals are:
 - if there's any error, do our best to show it;
 - but only show each error once;
 - simplify the code, unify init implementations.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-04 13:00:35 -07:00
Kir Kolyshkin
789a73db22 init.go: move logger setup to StartInitialization
Currently, logrus is used from the Go part of runc init, mostly for a
few debug messages (see setns_init_linux.go and standard_init_linux.go),
and a single warning (see rootfs_linux.go).

This means logrus is part of init implementation, and thus, its setup
belongs to StartInitialization().

Move the code there. As a nice side effect, now we don't have to convert
_LIBCONTAINER_LOGPIPE twice.

Note that since this initialization is now also called from libct/int
tests, which do not set _LIBCONTAINER_LOGLEVEL, let's make
_LIBCONTAINER_LOGLEVEL optional.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-04 13:00:34 -07:00