In these cases, this is exactly what we want to find out.
Slightly improves performance and readability.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).
Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.
This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:
failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295
This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.
In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.
This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If a user specifies a configuration like "rro, rrw", we should have
similar behaviour to "ro, rw" where we clear the previous flags so that
the last specified flag takes precendence.
Fixes: 382eba4354 ("Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
It turns out that the error added in commit 09822c3da8 ("configs:
disallow ambiguous userns and timens configurations") causes issues with
containerd and CRIO because they pass both userns mappings and a userns
path.
These configurations are broken, but to avoid the regression in this one
case, output a warning to tell the user that the configuration is
incorrect but we will continue to use it if and only if the configured
mappings are identical to the mappings of the provided namespace.
Fixes: 09822c3da8 ("configs: disallow ambiguous userns and timens configurations")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.
The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.
In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.
As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.
Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.
Finally, add an integration test to make sure we don't regress this again.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
* We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
* Revert 97f5ee4e6a ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
* Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
produce errors rather than silently producing incorrect atime
mounts.
* Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
[1]: https://github.com/util-linux/util-linux/issues/2433
Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
From the Go specification:
"1. For a nil slice, the number of iterations is 0." [1]
Therefore, an additional nil check for before the loop is unnecessary.
[1]: https://go.dev/ref/spec#For_range
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120
Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.
Thus, these annotations are no longer necessary. Hooray!
[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset
Example runtime spec
"timeOffsets": {
"monotonic": {
"secs": 172800,
"nanosecs": 0
},
"boottime": {
"secs": 604800,
"nanosecs": 0
}
}
Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.
Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.
Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.
* contrib/completions/bash/runc:
Add `--no-mount-fallback` command line option for bash completion.
* create.go:
Add `--no-mount-fallback` command line option.
* restore.go:
Add `--no-mount-fallback` command line option.
* run.go:
Add `--no-mount-fallback` command line option.
* libcontainer/configs/config.go:
Add `NoMountFallback` field to the `Config` struct to store
the command line option value.
* libcontainer/specconv/spec_linux.go:
Add `NoMountFallback` field to the `CreateOpts` struct to store
the command line option value and store it in the libcontainer
config.
* utils_linux.go:
Store the command line option value in the `CreateOpts` struct.
* libcontainer/rootfs_linux.go:
In case that `--no-mount-fallback` is not set try to remount the
bind filesystem again with the options nodev, nosuid, noexec,
noatime, relatime, strictatime or nodiratime if they are set on
the source filesystem.
* tests/integration/mounts_sshfs.bats:
Add testcases and rework sshfs setup to allow specifying
different mount options depending on the test case.
Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
We just added this helper for other parts of the code, let's switch this
function to use the helper too.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
From caaef1ba8c :
> In fact SB_POSIXACL is an internal flag, and accepting MS_POSIXACL on
> the mount(2) interface is possibly a bug.
Fix issue 3738
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
If no seccomps flags are set in OCI runtime spec (not even the empty
set), set SPEC_ALLOW as the default (if it's supported).
Otherwise, use the flags as they are set (that includes no flags for
empty seccomp.Flags array).
This mimics the crun behavior, and makes runc seccomp performance on par
with crun.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Amend runc features to print seccomp flags. Two set of flags are added:
* known flags are those that this version of runc is aware of;
* supported flags are those that can be set; normally, this is the same
set as known flags, but due to older version of kernel and/or
libseccomp, some known flags might be unsupported.
This commit also consolidates three different switch statements dealing
with flags into one, in func setFlag. A note is added to this function
telling what else to look for when adding new flags.
Unfortunately, it also adds a list of known flags, that should be
kept in sync with the switch statement.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.
In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.
In cgroup v2, same operation succeeds, and the container is OOM killed.
Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.
Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.
Add some test cases.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.
Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.
Some more fixes (and enhancements) might be required.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW
Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.
See similar commit in crun:
fefabffa28
Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.
Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
Looking through git blame, this was added by commit 9fac18329
aka "Initial commit of runc binary", most probably by mistake.
Obviously, a container should not have access to tun/tap device, unless
it is explicitly specified in configuration.
Now, removing this might create a compatibility issue, but I see no
other choice.
Aside from the obvious misconfiguration, this should also fix the
annoying
> Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory
messages from systemd on every container start, when runc uses systemd
cgroup driver, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).
[1] d5aecba6e0
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 029b73c1b replaced a regular expression with code checking the
characters. Despite what the comment said about ASCII, the check was
performed rune by rune, not byte by byte.
Note the check was still correct, basically comparing int32's, but the
byte by byte way is a tad faster and more straightforward. The change
also fixes the issue of a misleading comment.
Benchmark before/after shows a modest improvement:
name old time/op new time/op delta
CheckPropertyName-4 164ns ± 2% 123ns ± 2% -24.73% (p=0.029 n=4+4)
name old alloc/op new alloc/op delta
CheckPropertyName-4 96.0B ± 0% 64.0B ± 0% -33.33% (p=0.029 n=4+4)
name old allocs/op new allocs/op delta
CheckPropertyName-4 6.00 ± 0% 4.00 ± 0% -33.33% (p=0.029 n=4+4)
Fixes: 029b73c1b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html
Requires kernel >= 5.12.
The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.
Fix issue 2823
Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
https://github.com/opencontainers/runtime-spec/pull/1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.
Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
Using null bytes as control characters for sending strings via netlink
opens us up to a user explicitly putting a null byte in a mount string
(which JSON will happily let you do) and then causing us to open a mount
path different to the one expected.
In practice this is more of an issue in an environment such as
Kubernetes where you may have path-based access control policies (which
are more susceptible to these kinds of flaws).
Found by Google Project Zero.
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
1. Decapitalize errors.
2. Rename isValidName to checkPropertyName.
3. Make it return a specific error.
Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Also, add a simple test and a benchmark (just out of sheer curiosity).
Benchmark results:
name old time/op new time/op delta
IsValidName-4 540ns ± 3% 45ns ± 1% -91.76% (p=0.008 n=5+5)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 1cd71dfd7 added isSecSuffix, but the same thing can be done
easily without a regex. This is faster and saves some init time and
memory.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
parseMountOption already returns way too many values, making the code
kind of hard to read.
Since all of the return values are used as is to populate the fields of
configs.Mount, let's change it to return (semi-)populated *configs.Mount
instead.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This makes the repeated calls to parseMountOptions faster,
and decreases the amount of garbage to collect.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These two maps are the same, except that mountPropagationMapping
has an extra element with key of "" and value of 0. Since the
code already checks for f != 0, this extra element is not a problem.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Eliminate some of these allocations when starting runc:
> init github.com/opencontainers/runc/libcontainer/specconv @10 ms, 0.11 ms clock, 5408 bytes, 70 allocs
Most of this (4K) is the two regexes, which are left intact for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>