Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
https://github.com/opencontainers/runtime-spec/pull/1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.
Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
Using null bytes as control characters for sending strings via netlink
opens us up to a user explicitly putting a null byte in a mount string
(which JSON will happily let you do) and then causing us to open a mount
path different to the one expected.
In practice this is more of an issue in an environment such as
Kubernetes where you may have path-based access control policies (which
are more susceptible to these kinds of flaws).
Found by Google Project Zero.
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
1. Decapitalize errors.
2. Rename isValidName to checkPropertyName.
3. Make it return a specific error.
Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Also, add a simple test and a benchmark (just out of sheer curiosity).
Benchmark results:
name old time/op new time/op delta
IsValidName-4 540ns ± 3% 45ns ± 1% -91.76% (p=0.008 n=5+5)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 1cd71dfd7 added isSecSuffix, but the same thing can be done
easily without a regex. This is faster and saves some init time and
memory.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
parseMountOption already returns way too many values, making the code
kind of hard to read.
Since all of the return values are used as is to populate the fields of
configs.Mount, let's change it to return (semi-)populated *configs.Mount
instead.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This makes the repeated calls to parseMountOptions faster,
and decreases the amount of garbage to collect.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These two maps are the same, except that mountPropagationMapping
has an extra element with key of "" and value of 0. Since the
code already checks for f != 0, this extra element is not a problem.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Eliminate some of these allocations when starting runc:
> init github.com/opencontainers/runc/libcontainer/specconv @10 ms, 0.11 ms clock, 5408 bytes, 70 allocs
Most of this (4K) is the two regexes, which are left intact for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.
A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.
As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:
writing syncT <act>: writing syncT: <err>
By adjusting the code path, now they just look like this
writing syncT <act>: <err>
Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Handle ClosID parameter of IntelRdt. Makes it possible to use
pre-configured classes/ClosIDs and avoid running out of available IDs
which easily happens with per-container classes.
Remove validator checks for empty L3CacheSchema and MemBwSchema fields
in order to be able to leave them empty, and only specify ClosID for
a pre-configured class.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
In cases we have something like
if y != "" {
x = y
}
where both x and y are strings, and x was not set before,
it makes no sense to have a condition, as such code is
equivalent to mere
x = y
Simplify such cases by removing "if".
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commits 1f1e91b1a0 and 2192670a24
added validation for mountpoints to be an absolute path, to match the OCI
specs.
Unfortunately, the old behavior (accepting the path to be a relative path)
has been around for a long time, and although "not according to the spec",
various higher level runtimes rely on this behavior.
While higher level runtime have been updated to address this requirement,
there will be a transition period before all runtimes are updated to carry
these fixes.
This patch relaxes the validation, to generate a WARNING instead of failing,
allowing runtimes to update (but allowing them to update runc to the current
version, which includes security fixes).
We can remove this exception in a future patch release.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Per OCI runtime spec, mount destination MUST be absolute. Let's check
that and return an error if not.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
c.Parent is only used by systemd cgroup drivers, and both v1 and v2
drivers do have code to set the default if it is empty, so setting
it here is redundant.
In addition, in case of cgroup v2 rootless container setting it here
is harmful as the default should be user.slice not system.slice.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.
Unfortunately there are several issues which stop us from having a clean
solution to this problem:
1. libseccomp has several limitations which require us to emulate
behaviour we want:
a. We cannot do logic based on syscall number, meaning we cannot
specify a "largest known syscall number";
b. libseccomp doesn't know in which kernel version a syscall was
added, and has no API for "minimum kernel version" so we cannot
simply ask libseccomp to generate sane -ENOSYS rules for us.
c. Additional seccomp rules for the same syscall are not treated as
distinct rules -- if rules overlap, seccomp will merge them. This
means we cannot add per-syscall -EPERM fallbacks;
d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
e. libseccomp does not allow you to specify multiple rules for a
single argument, making it impossible to invert OR rules for
arguments.
2. The runtime-spec does not have any way of specifying:
a. The errno for the default action;
b. The minimum kernel version or "newest syscall at time of profile
creation"; nor
c. Which syscalls were intentionally excluded from the allow list
(weird syscalls that are no longer used were excluded entirely,
but Docker et al expect those syscalls to get EPERM not ENOSYS).
3. Certain syscalls should not return -ENOSYS (especially only for
certain argument combinations) because this could also trigger glibc
confusion. This means we have to return -EPERM for certain syscalls
but not as a global default.
4. There is not an obvious (and reasonable) upper limit to syscall
numbers, so we cannot create a set of rules for each syscall above
the largest syscall number in libseccomp. This means we must handle
inverse rules as described below.
5. Any syscall can be specified multiple times, which can make
generation of hotfix rules much harder.
As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.
The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.
Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.
The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Runc has a set of default devices that it includes in Linux containers
(e.g., /dev/null, /dev/random, /dev/tty, etc.)
However if the container's OCI spec includes all or a subset of those same devices,
runc is currently not detecting the redundancy, causing it to create a lib
container config that has redundant device configurations.
This causes a failure in rootless mode, in particular when the /dev/tty device
has a redundant config:
container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:70: creating device nodes caused: open /tmp/busyboxtest/rootfs/dev/tty: no such device or address"
The reason this fails in rootless mode only is that in this case runc sets up
/dev/tty not by doing mknod (it's not allowed within a user-ns) but rather by
creating a regular file under /dev/tty and bind-mounting the host's /dev/tty to
the container's /dev/tty. When this operation is done redundantly, it fails the
second time.
This change fixes this problem by ensuring runc checks for redundant devices
between the OCI spec it receives and the default devices it configures. If
a redundant device is detected, the OCI spec takes priority.
The change adds both a unit test and an integration test to verify the
behavior. Without this fix, this new integration test fails as shown above.
Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).
In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).
This is one of many changes needed to clean up the devices cgroup code.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig
Closes#2397
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>