ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.
This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.
We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).
In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.
It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Bind-mounts cannot have any filesystem-specific "data" arguments,
because the kernel ignores the data argument for MS_BIND and
MS_BIND|MS_REMOUNT and we cannot safely try to override the flags
because those would affect mounts on the host (these flags affect the
superblock).
It should be noted that there are cases where the filesystem-specified
flags will also be ignored for non-bind-mounts but those are kernel
quirks and there's no real way for us to work around them. And users
wouldn't get any real benefit from us adding guardrails to existing
kernel behaviour.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These are not exhaustive, but at least confirm that the feature is not
obviously broken (we correctly set the time offsets).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This reverts commit 881e92a3fd and adjust
the code so the idmap validations are strict.
We now only throw a warning and the container is started just fine.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
This was a warning already and it was requested to make this an error
while we will add validation of idmap mounts:
https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318
I've also tested a k8s cluster and the config.json generated by
containerd didn't use any relative paths. I tested one pod, so it was
definitely not an extensive test.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.
Having validate.Validator field in Factory does not make sense for all
the same reasons.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Runtime spec says:
> sysctl (object, OPTIONAL) allows kernel parameters to be modified at
> runtime for the container. For more information, see the sysctl(8)
> man page.
and sysctl(8) says:
> variable
> The name of a key to read from. An example is
> kernel.ostype. The '/' separator is also accepted in place of a '.'.
Apparently, runc config validator do not support sysctls with / as a
separator. Fortunately this is a one-line fix.
Add some more test data where / is used as a separator.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Replace ioutil.TempDir (mostly) with t.TempDir, which require no
explicit cleanup.
While at it, fix incorrect usage of os.ModePerm in libcontainer/intelrdt
test. This is supposed to be a mask, not mode bits.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commits 1f1e91b1a0 and 2192670a24
added validation for mountpoints to be an absolute path, to match the OCI
specs.
Unfortunately, the old behavior (accepting the path to be a relative path)
has been around for a long time, and although "not according to the spec",
various higher level runtimes rely on this behavior.
While higher level runtime have been updated to address this requirement,
there will be a transition period before all runtimes are updated to carry
these fixes.
This patch relaxes the validation, to generate a WARNING instead of failing,
allowing runtimes to update (but allowing them to update runc to the current
version, which includes security fixes).
We can remove this exception in a future patch release.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Previously we only tested failures, which causes us to miss issues where
setting sysctls would *always* fail.
Signed-off-by: Aleksa Sarai <asarai@suse.de>