The warnings fixed were:
libcontainer/configs/config_test.go:205:12: printf: non-constant format string in call to (*testing.common).Errorf (govet)
t.Errorf(fmt.Sprintf("Expected error to not occur but it was %+v", err))
^
libcontainer/cgroups/fs/blkio_test.go:481:13: printf: non-constant format string in call to (*testing.common).Errorf (govet)
t.Errorf(fmt.Sprintf("test case '%s' failed unexpectedly: %s", testCase.desc, err))
^
libcontainer/cgroups/fs/blkio_test.go:595:13: printf: non-constant format string in call to (*testing.common).Errorf (govet)
t.Errorf(fmt.Sprintf("test case '%s' failed unexpectedly: %s", testCase.desc, err))
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:
failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
parsing id map failed: invalid format in line " 0 0 4294967295": integer overflow on token 4294967295
This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.
In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.
This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.
We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If a user has misconfigured their userns mappings, they need to know
which id specifically is not mapped. There's no need to be vague.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.
The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.
In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.
As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.
Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.
Finally, add an integration test to make sure we don't regress this again.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Bind-mounts cannot have any filesystem-specific "data" arguments,
because the kernel ignores the data argument for MS_BIND and
MS_BIND|MS_REMOUNT and we cannot safely try to override the flags
because those would affect mounts on the host (these flags affect the
superblock).
It should be noted that there are cases where the filesystem-specified
flags will also be ignored for non-bind-mounts but those are kernel
quirks and there's no real way for us to work around them. And users
wouldn't get any real benefit from us adding guardrails to existing
kernel behaviour.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
* We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
* Revert 97f5ee4e6a ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
* Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
produce errors rather than silently producing incorrect atime
mounts.
* Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
[1]: https://github.com/util-linux/util-linux/issues/2433
Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The runtime spec now allows relative mount dst paths, so remove the
comment saying we will switch this to an error later and change the
error messages to reflect that.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120
Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
When a hook has failed, the error message looks like this:
> error running hook: error running hook #1: exit status 1, stdout: ...
The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;
Change that to
> error running createContainer hook #1: exit status 1, stdout: ...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In the runc state JSON we always use snake_case. This is a no-op change,
but it will cause any existing container state files to be incorrectly
parsed. Luckily, commit fbf183c6f8 ("Add uid and gid mappings to
mounts") has never been in a runc release so we can change this before a
1.2.z release.
Fixes: fbf183c6f8 ("Add uid and gid mappings to mounts")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These are not exhaustive, but at least confirm that the feature is not
obviously broken (we correctly set the time offsets).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This reverts commit 881e92a3fd and adjust
the code so the idmap validations are strict.
We now only throw a warning and the container is started just fine.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset
Example runtime spec
"timeOffsets": {
"monotonic": {
"secs": 172800,
"nanosecs": 0
},
"boottime": {
"secs": 604800,
"nanosecs": 0
}
}
Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.
Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.
Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.
* contrib/completions/bash/runc:
Add `--no-mount-fallback` command line option for bash completion.
* create.go:
Add `--no-mount-fallback` command line option.
* restore.go:
Add `--no-mount-fallback` command line option.
* run.go:
Add `--no-mount-fallback` command line option.
* libcontainer/configs/config.go:
Add `NoMountFallback` field to the `Config` struct to store
the command line option value.
* libcontainer/specconv/spec_linux.go:
Add `NoMountFallback` field to the `CreateOpts` struct to store
the command line option value and store it in the libcontainer
config.
* utils_linux.go:
Store the command line option value in the `CreateOpts` struct.
* libcontainer/rootfs_linux.go:
In case that `--no-mount-fallback` is not set try to remount the
bind filesystem again with the options nodev, nosuid, noexec,
noatime, relatime, strictatime or nodiratime if they are set on
the source filesystem.
* tests/integration/mounts_sshfs.bats:
Add testcases and rework sshfs setup to allow specifying
different mount options depending on the test case.
Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
This was a warning already and it was requested to make this an error
while we will add validation of idmap mounts:
https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318
I've also tested a k8s cluster and the config.json generated by
containerd didn't use any relative paths. I tested one pod, so it was
definitely not an extensive test.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.
The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.
Fix the issue, add a test case (which fails without the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
configs package can no longer be built on non-Linux OS, such as Darwin.
When running `GOOS=darwin go build` on the packge, we had the following
errors:
```
./configs/mount.go:34:16: undefined: unix.MountAttr
./configs/mount.go:47:22: undefined: unix.MS_BIND
```
Let's ensure that the linux specific bits are handled in mount_linux.go,
and introduce a _unsupported file, similar to how cgroups file is
handled within the package. This'll facilitate utilization of the pkg
for other projects that care about Darwin.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.
In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.
In cgroup v2, same operation succeeds, and the container is OOM killed.
Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.
Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.
Add some test cases.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.
Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.
Some more fixes (and enhancements) might be required.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW
Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.
See similar commit in crun:
fefabffa28
Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.
Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
the struct blockIODevice is used in an exported struct but it is not itself exported rendering that type inaccessible to
outside projects
Signed-off-by: cdoern <cdoern@redhat.com>
1. Fix function docs. In particular, remove the part
which is not true ("verifies that the user isn't trying to set up any
mounts they don't have the rights to do"), and fix the part that
says "that doesn't resolve to root" (which is no longer true since
commit d8b669400a).
2. Replace fmt.Sscanf (which is slow and does lots of allocations)
with strings.TrimPrefix and strconv.Atoi.
3. Add a benchmark for rootlessEUIDMount. Comparing the old and the new
implementations:
name old time/op new time/op delta
RootlessEUIDMount-4 1.01µs ± 2% 0.16µs ± 1% -84.15% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RootlessEUIDMount-4 224B ± 0% 80B ± 0% -64.29% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RootlessEUIDMount-4 7.00 ± 0% 1.00 ± 0% -85.71% (p=0.008 n=5+5)
Note this code is already tested (in rootless_test.go).
Fixes: d8b669400a
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Don't require CAT or MBA because we don't detect those correctly (we
don't support L2 or L3DATA/L3CODE for example, and in the future
possibly even more). With plain "ClosId mode" we don't really care: we
assign the container to a pre-configured CLOS without trying to do
anything smarter.
Moreover, this was a duplicate/redundant check anyway, as for CAT and
MBA there is another specific sanity check that is done if L3 or MB
is specified in the config.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.
Having validate.Validator field in Factory does not make sense for all
the same reasons.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html
Requires kernel >= 5.12.
The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.
Fix issue 2823
Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
https://github.com/opencontainers/runtime-spec/pull/1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.
Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>