When a hook has failed, the error message looks like this:
> error running hook: error running hook #1: exit status 1, stdout: ...
The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;
Change that to
> error running createContainer hook #1: exit status 1, stdout: ...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Currently, logrus is used from the Go part of runc init, mostly for a
few debug messages (see setns_init_linux.go and standard_init_linux.go),
and a single warning (see rootfs_linux.go).
This means logrus is part of init implementation, and thus, its setup
belongs to StartInitialization().
Move the code there. As a nice side effect, now we don't have to convert
_LIBCONTAINER_LOGPIPE twice.
Note that since this initialization is now also called from libct/int
tests, which do not set _LIBCONTAINER_LOGLEVEL, let's make
_LIBCONTAINER_LOGLEVEL optional.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset
Example runtime spec
"timeOffsets": {
"monotonic": {
"secs": 172800,
"nanosecs": 0
},
"boottime": {
"secs": 604800,
"nanosecs": 0
}
}
Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
Previously to this commit, we used a string for srcFD as /proc/self/fd/NN.
This commit modified to this behavior, so srcFD is only an *int and the full path
is constructed in mountViaFDs() if srcFD is different than nil.
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
This commit adds support for idmap mounts as specified in the runtime-spec.
We open the idmap source paths and call mount_setattr() in runc PARENT,
as we need privileges in the init userns for that, and then sends the
fds to the child process. For this fd passing we use the same mechanism
used in other parts of thecode, the _LIBCONTAINER_ env vars.
The mount is finished (unix.MoveMount) from go code, inside the userns,
so we reuse all the prepareBindMount() security checks and the remount
logic for some flags too.
This commit only supports idmap mounts when userns are used AND the mappings
are the same specified for the userns mapping. This limitation is to
simplify the initial implementation, as all our users so far only need
this, and we can avoid sending over netlink the mappings, creating a
userns with this custom mapping, etc. Future PRs will remove this
limitation.
Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Let's move the code to send mount sources to a generic function. Future
patches will use it for idmap sources too.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
As of previous commit, this is implied in a particular scenario. In
fact, this is the one and only scenario that justifies the use of -a.
Drop the option from the documentation. For backward compatibility, do
recognize it, and retain the feature of ignoring the "container is
stopped" error when set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.
Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).
The solution to the above problem is killing all the container
processes, not just init.
The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
- func destroy(c *Container) calls signalAllProcesses;
- runc kill implements -a flag.
This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.
In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.
The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.
Fix the issue, add a test case (which fails without the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are two very distinct usage scenarios for signalAllProcesses:
* when used from the runc binary ("runc kill" command), the processes
that it kills are not the children of "runc kill", and so calling
wait(2) on each process is totally useless, as it will return ECHLD;
* when used from a program that have created the container (such as
libcontainer/integration test suite), that program can and should call
wait(2), not the signalling code.
So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.
Remove it, and add a proper documentation piece.
Change the integration test accordingly.
PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb00c, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.
While at it:
- simplify signalAllProcesses to use unix.Kill;
- document (container).Signal.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Adding fd field to mountConfig was not a good thing since mountConfig
contains data that is not specific to a particular mount, while fd is
a mount entry attribute.
Introduce mountEntry structure, which embeds configs.Mount and adds
srcFd to replace the removed mountConfig.fd.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Simplify mount call by removing the procfd argument, and use the new
mount() where procfd is not used. Now, the mount() arguments are the
same as for unix.Mount.
2. Introduce a new mountViaFDs function, which is similar to the old
mount(), except it can take procfd for both source and target.
The new arguments are called srcFD and dstFD.
3. Modify the mount error to show both srcFD and dstFD so it's clear
which one is used for which purpose. This fixes the issue of having
a somewhat cryptic errors like this:
> mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted
(in which fd 11 is actually the source, and fd 12 is the target).
After this change, it looks like
> mount src=/proc/self/fd/11, dst=/sys/fs/cgroup/systemd, dstFD=/proc/self/fd/12, flags=0x20502f: operation not permitted
so it's clear that 12 is a destination fd.
4. Fix the mountViaFDs callers to use dstFD (rather than procfd) for the
variable name.
5. Use srcFD where mountFd is set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For a previous attempt to fix that (and added test cases), see commit
9087f2e827.
Alas, it's not always working because of cgroup directory TOCTOU.
To solve this and avoid the race, add an error _after_ the operation.
Implement it as a method that ignores the error that should be ignored.
Instead of currentStatus(), use faster runType(), since we are not
interested in Paused status here.
For Processes(), remove the pre-op check, and only use it after getting
an error, making the non-error path more straightforward.
For Signal(), add a second check after getting an error. The first check
is left as is because signalAllProcesses might print a warning if the
cgroup does not exist, and we'd like to avoid that.
This should fix an occasional failure like this one:
not ok 84 kill detached busybox
# (in test file tests/integration/kill.bats, line 27)
# `[ "$status" -eq 0 ]' failed
....
# runc kill test_busybox KILL (status=0):
# runc kill -a test_busybox 0 (status=1):
# time="2023-04-04T18:24:27Z" level=error msg="lstat /sys/fs/cgroup/devices/system.slice/runc-test_busybox.scope: no such file or directory"
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I don't want to implement it now, because this might result in some
new issues, but this is definitely something that is worth implementing.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The v6.0.0 release of go-criu has deprecated the `rpc` package in favour
of the `crit` package. This commit provides the changes required to use
this version in runc.
Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Function strings.Title is deprecated as of Go 1.18, because it does not
handle some corner cases good enough. In this case, though, it is
perfectly fine to use it since we have a single ASCII word as an
argument, and strings.Title won't be removed until at least Go 2.0.
Suppress the deprecation warning.
The alternative is to not capitalize the namespace string; this will break
restoring of a container checkpointed by earlier version of runc.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since we are looking up the path to newuidmap/newgidmap in one context,
and executing those in another (libct/nsenter), it might make sense to
use a stricter rules for looking up path to those binaries.
Practically it means that if someone wants to use custom newuidmap and
newgidmap binaries from $PATH, it would be impossible to use these from
the current directory by means of PATH=.:$PATH; instead one would have
to do something like PATH=$(pwd):$PATH.
See https://go.dev/blog/path-security for background.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those are *always* /proc/self/exe init, and it does not make sense
to ever change these. More to say, if InitArgs option func (removed
by this commit) is used to change these parameters, it will break
things, since "init" is hardcoded elsewhere.
Remove this.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the next commit is going to touch this structure, our CI
(lint-extra) is about to complain about improperly named field:
> Warning: var-naming: struct field ContainerId should be ContainerID (revive)
Make it happy.
Brought to use by gopls rename.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The utils.Annotations was used here before only because it made it
possible to distinguish between "key not found" and "empty value" cases.
With the previous commit, utils.SearchLabels can do that, and so it
makes sense to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When writing netlink messages, it is possible to have a byte array
larger than UINT16_MAX which would result in the length field
overflowing and allowing user-controlled data to be parsed as control
characters (such as creating custom mount points, changing which set of
namespaces to allow, and so on).
Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Using null bytes as control characters for sending strings via netlink
opens us up to a user explicitly putting a null byte in a mount string
(which JSON will happily let you do) and then causing us to open a mount
path different to the one expected.
In practice this is more of an issue in an environment such as
Kubernetes where you may have path-based access control policies (which
are more susceptible to these kinds of flaws).
Found by Google Project Zero.
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
To solve this problem, this patch performs the following:
1. in nsexec.c, it opens the source path in the host userns (so we have
the right permissions to open it) but in the container mntns (so the
kernel cross mntns mount check let us mount it later:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).
2. in nsexec.c, it passes the file descriptors of the source to the
child process with SCM_RIGHTS.
3. In runc-init in Golang, it finishes the mounts while inside the
userns even without access to the some components of the source
paths.
Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.
This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.
This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.
Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).
This patch takes care of using O_CLOEXEC on mount fds, and close them
early.
Fixes: #2484.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Wire through CRIU's support to change the mount context on restore.
This is especially useful if restoring a container in a different pod.
Single container restore uses the same SELinux process label and
same mount context as during checkpointing. If a container is being
restored into an existing pod the process label and the mount context
needs to be changed to the context of the pod.
Changing process label on restore is already supported by runc. This
patch adds the possibility to change the mount context.
Signed-off-by: Adrian Reber <areber@redhat.com>
runc delete -f is not working for a paused container, since in cgroup v1
SIGKILL does nothing if a process is frozen (unlike cgroup v2, in which
you can kill a frozen process with a fatal signal).
Theoretically, we only need this for v1, but doing it for v2 as well is
OK.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>