Commit 770728e1 added Scheduler field into both Config and Process,
but forgot to add a mechanism to actually use Process.Scheduler.
As a result, runc exec does not set Process.Scheduler ever.
Fix it, and a test case (which fails before the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit bfbd0305b added IOPriority field into both Config and Process,
but forgot to add a mechanism to actually use Process.IOPriority.
As a result, runc exec does not set Process.IOPriority ever.
Fix it, and a test case (which fails before the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For all other properties that are available in both Config and Process,
the merging is performed by newInitConfig.
Let's do the same for Capabilities for the sake of code uniformity.
Also, thanks to the previous commit, we no longer have to make sure we
do not call capabilities.New(nil).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
They are passed in initConfig twice, so it does not make sense.
NB: the alternative to that would be to remove Config field from
initConfig, but it results in a much bigger patch and more maintenance
down the road.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is one of the dark corners of runc / libcontainer, so let's shed
some light on it.
initConfig is a structure which is filled in [mostly] by newInitConfig,
and one of its hidden aspects is it contains a process config which is
the result of merge between the container and the process configs.
Let's document how all this happens, where the fields are coming from,
which one has a preference, and how it all works.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Every time we call container.Config(), a new copy of
struct Config is created and returned, and we do it twice here.
Accessing container.config directly fixes this.
Fixes: 805b8c73d ("Do not create exec fifo in factory.Create")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the UID/GID/AdditonalGroups fields are now numeric,
we can address the following TODO item in the code (added
by commit d2f49696 back in 2016):
> TODO: We currently can't do
> this check earlier, but if libcontainer.Process.User was typesafe
> this might work.
Move the check to much earlier phase, when we're preparing
to start a process in a container.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):
> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.
Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).
What happens next is:
1. runc start/run/exec converts those credentials to strings (a User
string containing "UID:GID", and a []string for additional GIDs) and
passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
possible (parsing container's /etc/passwd and /etc/group).
All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).
The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).
Let's remove this back and forth translation, hopefully increasing
runc exec performance.
The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).
PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.
Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.
For 3998.
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The previous logic caused runc to hang if CloseExecFrom returned an
error, as the defer waiting on logsDone never finished as the parent
process was never started (and it controls the closing of logsDone via
it's logsPipe).
This moves the defer to after we have started the parent, with means all
the logic related to managing the logsPipe should also be running.
Signed-off-by: Evan Phoenix <evan@phx.io>
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Introduce a common parent struct `containerProcess`,
let both `initProcess` and `setnsProcess` are inherited
from it.
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Because we have the overlay solution, we can drop runc-dmz binary
solution since it has too many limitations.
Signed-off-by: lifubang <lifubang@acmcoder.com>
Since Go 1.19, the same functionality is there in os/exec package.
As we require go 1.22 now, there's no need to have this.
This basically reverts commit 9258eac0 ("libct/start: use execabs for
newuidmap lookup").
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is a race situation when we are opening a file, if there is a
small fd was closed at that time, maybe it will be reused by safeExe.
Because of Go stdlib fds shuffling bug, if the fd of safeExe is too
small, go stdlib will dup3 it to another fd, or dup3 a other fd to this
fd, then it will cause the fd type cmd.Path refers to a random path,
and it can lead to an error "permission denied" when starting the process.
Please see #4294 and <https://github.com/golang/go/issues/61751>.
So we should not use the original fd of safeExe, but use the fd after
shuffled by Go stdlib. Because Go stdlib will guarantee this fd refers to
the correct file.
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
For some rootless container, runc has no access to cgroup,
But the container is still running. So we should return the
`ErrNotRunning` and `ErrCgroupNotExist` error seperatlly.
Signed-off-by: lifubang <lifubang@acmcoder.com>
`signalAllProcesses()` depends on the cgroup and is expected to fail
when runc is running in rootless without an access to the cgroup.
When `RootlessCgroups` is set to `true`, runc just ignores the error
from `signalAllProcesses` and may leak some processes running.
(See the comments in PR 4395)
In the future, runc should walk the process tree to avoid such a leak.
Note that `RootlessCgroups` is a misnomer; it is set to `false` despite
the name when cgroup v2 delegation is configured.
This is expected to be renamed in a separate commit.
Fix issue 4394
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
By definition, every container has only 1 init (i.e. PID 1) process.
Apparently, libcontainer API supported running more than 1 init, and
at least one tests mistakenly used it.
Let's not allow that, erroring out if we already have init. Doing
otherwise _probably_ results in some confusion inside the library.
Fix two cases in libct/int which ran two inits inside a container.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. The code to call c.exec from c.Run was initially added by commit
3aacff695. At the time, there was a lock in c.Run. That lock was
removed by commit bd3c4f84, which resulted in part of c.Run executing
without the lock.
2. All the Start/Run/Exec calls were a mere wrappers for start/run/exec
adding a lock, but some more code crept into Start at some point,
e.g. by commits 805b8c73 and 108ee85b8. Since the reason mentioned in
commit 805b8c73 is no longer true after refactoring, we can fix this.
Fix both issues by moving code out of wrappers, and adding locking into
c.Run.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case file already exists, mknod(2) will return EEXIST.
This os.Stat call was (inadvertently?) added by commit 805b8c73.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Now that runc-dmz is opt-in, we no longer need to try to detect whether
SELinux would cause issues for us. We can also remove the
special-purpose build-tag we added.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Given the core issue in GHSA-xr7r-f8xq-vfvv was that we were unknowingly
leaking file descriptors to "runc init", it seems prudent to make sure
we proactively prevent this in the future. The solution is to simply
mark all non-stdio file descriptors as O_CLOEXEC before we spawn "runc
init".
For libcontainer library users, this could result in unrelated files
being marked as O_CLOEXEC -- however (for the same reason we are doing
this for runc), for security reasons those files should've been marked
as O_CLOEXEC anyway.
Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:
failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295
This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.
In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.
However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.
This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).
In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):
goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/userns
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
BenchmarkSpawnProc
BenchmarkSpawnProc-8 1670 770065 ns/op
Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If container.Destroy() has failed, runc destroy still return 0, which is
wrong and can result in other issues down the line.
Let's always return error from destroy in runc delete.
For runc checkpoint and runc run, we still treat it as a warning.
Co-authored-by: Zhang Tianyang <burning9699@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit f8ad20f made it impossible to kill leftover processes in a
stopped container that does not have its own PID namespace. In other
words, if a container init is gone, it is no longer possible to use
`runc kill` to kill the leftover processes.
Fix this by moving the check if container init exists to after the
special case of handling the container without own PID namespace.
While at it, fix the minor issue introduced by commit 9583b3d:
if signalAllProcesses is used, there is no need to thaw the
container (as freeze/thaw is either done in signalAllProcesses already,
or not needed at all).
Also, make signalAllProcesses return an error early if the container
cgroup does not exist (as it relies on it to do its job). This way, the
error message returned is more generic and easier to understand
("container not running" instead of "can't open file").
Finally, add a test case.
Fixes: f8ad20f
Fixes: 9583b3d
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The semantics of runType is slightly complicated, and the only place
where we need to distinguish between Created and Running is
refreshState.
Replace runType with simpler hasInit, simplifying its users (except the
refreshState, which now figures out on its own whether the container is
Created or Running).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.
And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
Add a workaround for a problem of older container-selinux not allowing
runc to use dmz feature. If runc sees that SELinux is in enforced mode
and the container's SELinux label is set, it disables dmz.
Add a build tag, runc_dmz_selinux_nocompat, which disables the workaround.
Newer distros that ship container-selinux >= 2.224.0 (currently CentOS
Stream 8 and 9, RHEL 8 and 9, and Fedora 38+) may build runc with this
build tag set to benefit from dmz working with SELinux.
Document the build tag in the top-level and libct/dmz READMEs.
Use the build tag in our CI builds for CentOS Stream 9 and Fedora 38,
as they already has container-selinux 2.224.0 available in updates.
Add a TODO to use the build tag for CentOS Stream 8 once it has
container-selinux updated.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The name "root" (or "containerRoot") is confusing; one might think it is
the root of container's file system (the directory we chroot into).
Rename to stateDir for clarity.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Umask is problematic for Go programs as it affects other goroutines
(see [1] for more details).
Instead of using it, let's just prop up with Chmod.
Note this patch misses the MkdirAll call in createDeviceNode. Since the
runtime spec does not say anything about creating intermediary
directories for device nodes, let's assume that doing it via mkdir with
the current umask set is sufficient (if not, we have to reimplement
MkdirAll from scratch, with added call to os.Chmod).
[1] https://github.com/opencontainers/runc/pull/3563#discussion_r990293788
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.
Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.
One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.
This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).
It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.
This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.
[1]: bfedb58925
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When a hook has failed, the error message looks like this:
> error running hook: error running hook #1: exit status 1, stdout: ...
The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;
Change that to
> error running createContainer hook #1: exit status 1, stdout: ...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>