ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.
This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.
We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.
The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.
In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.
As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.
Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.
Finally, add an integration test to make sure we don't regress this again.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Bind-mounts cannot have any filesystem-specific "data" arguments,
because the kernel ignores the data argument for MS_BIND and
MS_BIND|MS_REMOUNT and we cannot safely try to override the flags
because those would affect mounts on the host (these flags affect the
superblock).
It should be noted that there are cases where the filesystem-specified
flags will also be ignored for non-bind-mounts but those are kernel
quirks and there's no real way for us to work around them. And users
wouldn't get any real benefit from us adding guardrails to existing
kernel behaviour.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The runtime spec now allows relative mount dst paths, so remove the
comment saying we will switch this to an error later and change the
error messages to reflect that.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
These are not exhaustive, but at least confirm that the feature is not
obviously broken (we correctly set the time offsets).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This reverts commit 881e92a3fd and adjust
the code so the idmap validations are strict.
We now only throw a warning and the container is started just fine.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset
Example runtime spec
"timeOffsets": {
"monotonic": {
"secs": 172800,
"nanosecs": 0
},
"boottime": {
"secs": 604800,
"nanosecs": 0
}
}
Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
This was a warning already and it was requested to make this an error
while we will add validation of idmap mounts:
https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318
I've also tested a k8s cluster and the config.json generated by
containerd didn't use any relative paths. I tested one pod, so it was
definitely not an extensive test.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.
Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.
Some more fixes (and enhancements) might be required.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Fix function docs. In particular, remove the part
which is not true ("verifies that the user isn't trying to set up any
mounts they don't have the rights to do"), and fix the part that
says "that doesn't resolve to root" (which is no longer true since
commit d8b669400a).
2. Replace fmt.Sscanf (which is slow and does lots of allocations)
with strings.TrimPrefix and strconv.Atoi.
3. Add a benchmark for rootlessEUIDMount. Comparing the old and the new
implementations:
name old time/op new time/op delta
RootlessEUIDMount-4 1.01µs ± 2% 0.16µs ± 1% -84.15% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
RootlessEUIDMount-4 224B ± 0% 80B ± 0% -64.29% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
RootlessEUIDMount-4 7.00 ± 0% 1.00 ± 0% -85.71% (p=0.008 n=5+5)
Note this code is already tested (in rootless_test.go).
Fixes: d8b669400a
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Don't require CAT or MBA because we don't detect those correctly (we
don't support L2 or L3DATA/L3CODE for example, and in the future
possibly even more). With plain "ClosId mode" we don't really care: we
assign the container to a pre-configured CLOS without trying to do
anything smarter.
Moreover, this was a duplicate/redundant check anyway, as for CAT and
MBA there is another specific sanity check that is done if L3 or MB
is specified in the config.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.
Having validate.Validator field in Factory does not make sense for all
the same reasons.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Runtime spec says:
> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.
and sysctl(8) says:
> variable
> The name of a key to read from. An example is
> kernel.ostype. The '/' separator is also accepted in place of a '.'.
Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.
Add some more test data where / is used as a separator.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All the errors returned from Validate should tell about a configuration
error. Some were lacking a context, so add it.
While at it, fix abusing fmt.Errorf and logrus.Warnf where the argument
do not contain %-style formatting.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Handle ClosID parameter of IntelRdt. Makes it possible to use
pre-configured classes/ClosIDs and avoid running out of available IDs
which easily happens with per-container classes.
Remove validator checks for empty L3CacheSchema and MemBwSchema fields
in order to be able to leave them empty, and only specify ClosID for
a pre-configured class.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Replace ioutil.TempDir (mostly) with t.TempDir, which require no
explicit cleanup.
While at it, fix incorrect usage of os.ModePerm in libcontainer/intelrdt
test. This is supposed to be a mask, not mode bits.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commits 1f1e91b1a0 and 2192670a24
added validation for mountpoints to be an absolute path, to match the OCI
specs.
Unfortunately, the old behavior (accepting the path to be a relative path)
has been around for a long time, and although "not according to the spec",
various higher level runtimes rely on this behavior.
While higher level runtime have been updated to address this requirement,
there will be a transition period before all runtimes are updated to carry
these fixes.
This patch relaxes the validation, to generate a WARNING instead of failing,
allowing runtimes to update (but allowing them to update runc to the current
version, which includes security fixes).
We can remove this exception in a future patch release.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Add some minimal validation for cgroups. The following checks
are implemented:
- cgroup name and/or prefix (or path) is set;
- for cgroup v1, unified resources are not set;
- for cgroup v2, if memorySwap is set, memory is also set,
and memorySwap > memory.
This makes some invalid configurations fail earlier (before runc init
is started), which is better.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case many net.* sysctls are provided, and we're not running
in the host netns, the function keep repeating isNetNS check
for every such sysctl. This is a waste of resources.
Do the isNetNS check only once, and only if needed.
Note that using sync.Once() is not really needed here; we could
have used a boolean variable to skip the repeated check, but
it looks more idiomatic that way.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case nsfs mount (such as /run/docker/netns/xxxx) is provided as
the netns path, the current way of determining whether path is of
host netns or not is not working.
The proper way to check is to do stat(2) and compare dev_t and
inode fields, which is what this commit does.
This is a minimal fix which does not try to optimize repeated
check in case more than one net.* sysctl is given and there is
no error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:
```
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "cgroup"
}
],
```
Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file. We are already skipping the test
for a new userns to be present.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>