Commit b6967fa84c moved the functionality of managing cgroup devices
into a separate package, and decoupled libcontainer/cgroups from it.
Yet, some software (e.g. cadvisor) may need to use libcontainer package,
which imports libcontainer/cgroups/devices, thus making it impossible to
use libcontainer without bringing in cgroup/devices dependency.
In fact, we only need to manage devices in runc binary, so move the
import to main.go.
The need to import libct/cg/dev in order to manage devices is already
documented in libcontainer/cgroups, but let's
- update that documentation;
- add a similar note to libcontainer/cgroups/systemd;
- add a note to libct README.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.
However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.
This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).
In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):
goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/userns
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
BenchmarkSpawnProc
BenchmarkSpawnProc-8 1670 770065 ns/op
Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The name "root" (or "containerRoot") is confusing; one might think it is
the root of container's file system (the directory we chroot into).
Rename to stateDir for clarity.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We will add code that uses this function in future patches. So let's
just split it to a new function now and reuse it later.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Commit d08bc0c1b3 ("runc run: warn on non-empty cgroup") introduced
a warning when a container is started in a non-empty cgroup. Such
configuration has lots of issues.
In addition to that, such configuration is not possible at all when
using the systemd cgroup driver.
As planned, let's promote this warning to an error, and fix the test
case accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A new version of staticcheck (included into golangci-lint 1.46.2) gives
this new warning:
> libcontainer/factory_linux.go:230:59: SA9008: e refers to the result of a failed type assertion and is a zero value, not the value that was being type-asserted (staticcheck)
> err = fmt.Errorf("panic from initialization: %v, %s", e, debug.Stack())
> ^
> libcontainer/factory_linux.go:226:7: SA9008(related information): this is the variable being read (staticcheck)
> if e, ok := e.(error); ok {
> ^
Apparently, this is indeed a bug. Fix by using a different name for a
new variable, so we can access the old one under "else".
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit separates the functionality of setting cgroup device
rules out of libct/cgroups to libct/cgroups/devices package. This
package, if imported, sets the function variables in libct/cgroups and
libct/cgroups/systemd, so that a cgroup manager can use those to manage
devices. If those function variables are nil (when libct/cgroups/devices
are not imported), a cgroup manager returns the ErrDevicesUnsupported
in case any device rules are set in Resources.
It also consolidates the code from libct/cgroups/ebpf and
libct/cgroups/ebpf/devicefilter into libct/cgroups/devices.
Moved some tests in libct/cg/sd that require device management to
libct/sd/devices.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.
Modify all the users and the docs accordingly.
While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
StartInitialization does not have to be a method of Factory (while
it is clear why it was done that way initially, now we only have
Linux containers so it does not make sense).
Fix callers and docs accordingly.
No change in functionality.
Also, since this was the only user of libcontainer.New with the empty
string as an argument, the corresponding check can now be removed
from it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The only implementation is LinuxFactory, let's use this directly.
Move the piece of documentation about Create from removed factory.go to
the factory_linux.go.
The LinuxFactory is to be removed later.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.
Having validate.Validator field in Factory does not make sense for all
the same reasons.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those are *always* /proc/self/exe init, and it does not make sense
to ever change these. More to say, if InitArgs option func (removed
by this commit) is used to change these parameters, it will break
things, since "init" is hardcoded elsewhere.
Remove this.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is needed since the future commits will touch this code, and then
the lint-extra CI job complains.
> libcontainer/factory.go#L245
> var-naming: var fdsJson should be fdsJSON (revive)
and
> libcontainer/init_linux.go#L181
> error-strings: error strings should not be capitalized or end with punctuation or a newline (revive)
and
> notify_socket.go#L94
> receiver-naming: receiver name n should be consistent with previous receiver name s for notifySocket (revive)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(on Go 1.18 this is actually an error)
> libcontainer/factory_linux.go:341:10: fmt.Errorf format %w has arg e of wrong type interface{}
Unfortunately, fixing it results in an errorlint warning:
> libcontainer/factory_linux.go#L344 non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
so we have to silence that one.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 7cfb107f2c started using unix.Geteuid(), unix.Getegid() as
uid/gid argument for chown. It seems that it should have removed chown
entirely, since, according to mkdir(2),
> The newly created directory will be owned by the effective user ID of
> the process. If the directory containing the file has the
> set-group-ID bit set, or if the filesystem is mounted with BSD
> group semantics (mount -o bsdgroups or, synonymously mount -o grpid),
> the new directory will inherit the group ownership from its parent;
> otherwise it will be > owned by the effective group ID of the process.
So, the only effect of the chown after mkdir is ignoring the sgid bit
on the parent directory (which is probably not the right thing to do).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.
Refuse to run in a frozen cgroup; add a test case.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:
- When each container has its own resource limits, the order of
containers start determines whose limits will be effectively applied.
- When one of containers is paused, all others are paused, too.
- When a container is paused, any attempt to do runc create/run/exec
end up with runc init stuck inside a frozen cgroup.
- When a systemd cgroup manager is used, this becomes even worse -- such
as, stop (or even failed start) of any container results in
"stopTransientUnit" command being sent to systemd, and so (depending on
unit properties) other containers can receive SIGTERM, be killed after a
timeout etc.
Any of the above may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, init not
reaping zombies etc.).
One obvious solution is to refuse a non-empty cgroup when starting a new
container. This would be a breaking change though, so let's make it in
steps, with the first step is issue a warning and a deprecated notice
about a non-empty cgroup.
Later (in runc 1.2) we will replace this warning with an error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
To solve this problem, this patch performs the following:
1. in nsexec.c, it opens the source path in the host userns (so we have
the right permissions to open it) but in the container mntns (so the
kernel cross mntns mount check let us mount it later:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).
2. in nsexec.c, it passes the file descriptors of the source to the
child process with SCM_RIGHTS.
3. In runc-init in Golang, it finishes the mounts while inside the
userns even without access to the some components of the source
paths.
Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.
This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.
This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.
Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).
This patch takes care of using O_CLOEXEC on mount fds, and close them
early.
Fixes: #2484.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
if function returns error before WriteJSON defer, error will not be
printed out, so move this defer as early as possible and use logrus to
print out error if returns before it.
Signed-off-by: xiadanni <xiadanni1@huawei.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
ConfigError was added by commit e918d02139, while removing runc own
error system, to preserve a way for a libcontainer user to distinguish
between a configuration error and something else.
The way ConfigError is implemented requires a different type of check
(compared to all other errors defined by error.go). An attempt was made
to rectify this, but the resulting code became even more complicated.
As no one is using this functionality (of differentiating a "bad config"
type of error from other errors), let's just drop the ConfigError type.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.
Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from strconv.Atoi are already descriptive enough, and contain the
value being converted, so our error messages do not need to contain it.
While at it, use %w to wrap errors.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>