Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.
Having validate.Validator field in Factory does not make sense for all
the same reasons.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those are *always* /proc/self/exe init, and it does not make sense
to ever change these. More to say, if InitArgs option func (removed
by this commit) is used to change these parameters, it will break
things, since "init" is hardcoded elsewhere.
Remove this.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is needed since the future commits will touch this code, and then
the lint-extra CI job complains.
> libcontainer/factory.go#L245
> var-naming: var fdsJson should be fdsJSON (revive)
and
> libcontainer/init_linux.go#L181
> error-strings: error strings should not be capitalized or end with punctuation or a newline (revive)
and
> notify_socket.go#L94
> receiver-naming: receiver name n should be consistent with previous receiver name s for notifySocket (revive)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(on Go 1.18 this is actually an error)
> libcontainer/factory_linux.go:341:10: fmt.Errorf format %w has arg e of wrong type interface{}
Unfortunately, fixing it results in an errorlint warning:
> libcontainer/factory_linux.go#L344 non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)
so we have to silence that one.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 7cfb107f2c started using unix.Geteuid(), unix.Getegid() as
uid/gid argument for chown. It seems that it should have removed chown
entirely, since, according to mkdir(2),
> The newly created directory will be owned by the effective user ID of
> the process. If the directory containing the file has the
> set-group-ID bit set, or if the filesystem is mounted with BSD
> group semantics (mount -o bsdgroups or, synonymously mount -o grpid),
> the new directory will inherit the group ownership from its parent;
> otherwise it will be > owned by the effective group ID of the process.
So, the only effect of the chown after mkdir is ignoring the sgid bit
on the parent directory (which is probably not the right thing to do).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.
Refuse to run in a frozen cgroup; add a test case.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:
- When each container has its own resource limits, the order of
containers start determines whose limits will be effectively applied.
- When one of containers is paused, all others are paused, too.
- When a container is paused, any attempt to do runc create/run/exec
end up with runc init stuck inside a frozen cgroup.
- When a systemd cgroup manager is used, this becomes even worse -- such
as, stop (or even failed start) of any container results in
"stopTransientUnit" command being sent to systemd, and so (depending on
unit properties) other containers can receive SIGTERM, be killed after a
timeout etc.
Any of the above may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, init not
reaping zombies etc.).
One obvious solution is to refuse a non-empty cgroup when starting a new
container. This would be a breaking change though, so let's make it in
steps, with the first step is issue a warning and a deprecated notice
about a non-empty cgroup.
Later (in runc 1.2) we will replace this warning with an error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:
# time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
starting container process caused: process_linux.go:459:
container init caused: rootfs_linux.go:58:
mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
to rootfs at \"/tmp/inaccessible\" caused:
stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"
To solve this problem, this patch performs the following:
1. in nsexec.c, it opens the source path in the host userns (so we have
the right permissions to open it) but in the container mntns (so the
kernel cross mntns mount check let us mount it later:
https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).
2. in nsexec.c, it passes the file descriptors of the source to the
child process with SCM_RIGHTS.
3. In runc-init in Golang, it finishes the mounts while inside the
userns even without access to the some components of the source
paths.
Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.
This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.
This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.
Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).
This patch takes care of using O_CLOEXEC on mount fds, and close them
early.
Fixes: #2484.
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
if function returns error before WriteJSON defer, error will not be
printed out, so move this defer as early as possible and use logrus to
print out error if returns before it.
Signed-off-by: xiadanni <xiadanni1@huawei.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
ConfigError was added by commit e918d02139, while removing runc own
error system, to preserve a way for a libcontainer user to distinguish
between a configuration error and something else.
The way ConfigError is implemented requires a different type of check
(compared to all other errors defined by error.go). An attempt was made
to rectify this, but the resulting code became even more complicated.
As no one is using this functionality (of differentiating a "bad config"
type of error from other errors), let's just drop the ConfigError type.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.
Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from strconv.Atoi are already descriptive enough, and contain the
value being converted, so our error messages do not need to contain it.
While at it, use %w to wrap errors.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Sometimes debug.bats test cases are failing like this:
> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> # `[[ "${output}" == *"child process in init()"* ]]' failed
It happens more when writing to disk.
This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.
ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.
So, to fix the failure described above, we need to:
1. Make runc create/run/exec wait for ForwardLogs to finish;
2. Make runc init close its log pipe file descriptor (i.e.
the one which value is passed in _LIBCONTAINER_LOGPIPE
environment variable).
This is exactly what this commit does:
1. Amend ForwardLogs to return a channel, and wait for it in start().
2. In runc init, save the log fd and close it as late as possible.
PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.
To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.
End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.
While at it
- add `rootless` flag which is passed on to fs2 manager
- merge getv2Path() into GetUnifiedPath(), don't overwrite
path if it is set during initialization (on Load).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.
Inspired by containerd/cgroups#109
Fix#2157
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.
Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.
An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.
Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>