Commit Graph

162 Commits

Author SHA1 Message Date
Kir Kolyshkin
a56f85f87b libct/*: switch from configs to cgroups
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-11 19:08:40 -08:00
Kir Kolyshkin
b1449fd510 libct: use Namespaces.IsPrivate more
In these cases, this is exactly what we want to find out.

Slightly improves performance and readability.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-17 22:49:29 -07:00
Sebastiaan van Stijn
30b530ca94 libct/userns: split userns detection from internal userns code
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).

Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.

This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-30 20:06:30 +02:00
utam0k
bfbd0305ba Add I/O priority
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-03-30 22:31:54 +09:00
dependabot[bot]
606251ab33 build(deps): bump github.com/opencontainers/runtime-spec
Bumps [github.com/opencontainers/runtime-spec](https://github.com/opencontainers/runtime-spec) from 1.1.1-0.20230823135140-4fec88fd00a4 to 1.2.0.
- [Release notes](https://github.com/opencontainers/runtime-spec/releases)
- [Changelog](https://github.com/opencontainers/runtime-spec/blob/main/ChangeLog)
- [Commits](https://github.com/opencontainers/runtime-spec/commits/v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runtime-spec
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-03-07 14:43:33 +09:00
lfbzhm
371ff9c5e7 Merge pull request #3985 from cyphar/idmap-generic
libcontainer: remove all mount logic from nsexec
2023-12-18 13:10:45 +08:00
Aleksa Sarai
482e56379a configs: make id mappings int64 to better handle 32-bit
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:

  failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
  parsing id map failed: invalid format in line "         0          0 4294967295": integer overflow on token 4294967295

This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.

In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 12:14:32 +11:00
Aleksa Sarai
3b57e45cbf mount: add support for ridmap and idmap
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.

This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai
7795ca4668 specconv: handle recursive attribute clearing more consistently
If a user specifies a configuration like "rro, rrw", we should have
similar behaviour to "ro, rw" where we clear the previous flags so that
the last specified flag takes precendence.

Fixes: 382eba4354 ("Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai
ebcef3e651 specconv: temporarily allow userns path and mapping if they match
It turns out that the error added in commit 09822c3da8 ("configs:
disallow ambiguous userns and timens configurations") causes issues with
containerd and CRIO because they pass both userns mappings and a userns
path.

These configurations are broken, but to avoid the regression in this one
case, output a warning to tell the user that the configuration is
incorrect but we will continue to use it if and only if the configured
mappings are identical to the mappings of the provided namespace.

Fixes: 09822c3da8 ("configs: disallow ambiguous userns and timens configurations")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-10 20:49:43 +11:00
Aleksa Sarai
09822c3da8 configs: disallow ambiguous userns and timens configurations
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).

In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.

It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Aleksa Sarai
1912d5988b *: actually support joining a userns with a new container
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.

The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.

In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.

As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.

Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.

Finally, add an integration test to make sure we don't regress this again.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:08 +11:00
Zheao.Li
98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Aleksa Sarai
7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
utam0k
770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Eng Zer Jun
c378602bf6 libct/specconv: remove redundant nil check
From the Go specification:

  "1. For a nil slice, the number of iterations is 0." [1]

Therefore, an additional nil check for before the loop is unnecessary.

[1]: https://go.dev/ref/spec#For_range

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2023-09-08 23:40:51 +08:00
Kailun Qin
e1584831b6 libct/cg: add CFS bandwidth burst for CPU
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.

This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.

The patch adds the support/control for CFS bandwidth burst.

runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120

Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
2023-09-06 23:23:30 +08:00
Kir Kolyshkin
f62f0bdfbf Remove nolint annotations for unix errno comparisons
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.

Thus, these annotations are no longer necessary. Hooray!

[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 17:28:10 -07:00
Aleksa Sarai
9acfd7b1a3 timens: minor cleanups
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-10 18:59:55 +10:00
Chethan Suresh
ebc2e7c435 Support time namespace
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset

Example runtime spec

"timeOffsets": {
    "monotonic": {
        "secs": 172800,
        "nanosecs": 0
    },
    "boottime": {
        "secs": 604800,
        "nanosecs": 0
    }
}

Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
2023-08-03 10:12:01 +05:30
Ruediger Pluem
da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
Francis Laniel
c47f58c4e9 Capitalize [UG]idMappings as [UG]IDMappings
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Rodrigo Campos
4b668a8224 Switch setupUserNamespace() to use the toConfigIDMap() helper
We just added this helper for other parts of the code, let's switch this
function to use the helper too.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
Rodrigo Campos
fbf183c6f8 Add uid and gid mappings to mounts
Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
utam0k
d9230602e9 Implement to set a domainname
opencontainers/runtime-spec#1156

Signed-off-by: utam0k <k0ma@utam0k.jp>
2023-04-12 13:31:20 +00:00
Akihiro Suda
92a4ccb84e specconv: avoid mapping "acl" to MS_POSIXACL
From caaef1ba8c :

> In fact SB_POSIXACL is an internal flag, and accepting MS_POSIXACL on
> the mount(2) interface is possibly a bug.

Fix issue 3738

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-02-11 22:50:41 +09:00
wineway
81c379fa8b support SCHED_IDLE for runc cgroupfs
Signed-off-by: wineway <wangyuweihx@gmail.com>
2023-01-31 15:19:05 +08:00
Kir Kolyshkin
ac04154f0b seccomp: set SPEC_ALLOW by default
If no seccomps flags are set in OCI runtime spec (not even the empty
set), set SPEC_ALLOW as the default (if it's supported).

Otherwise, use the flags as they are set (that includes no flags for
empty seccomp.Flags array).

This mimics the crun behavior, and makes runc seccomp performance on par
with crun.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-29 17:24:32 -08:00
Kir Kolyshkin
076745a40f runc features: add seccomp filter flags
Amend runc features to print seccomp flags. Two set of flags are added:
 * known flags are those that this version of runc is aware of;
 * supported flags are those that can be set; normally, this is the same
   set as known flags, but due to older version of kernel and/or
   libseccomp, some known flags might be unsupported.

This commit also consolidates three different switch statements dealing
with flags into one, in func setFlag. A note is added to this function
telling what else to look for when adding new flags.

Unfortunately, it also adds a list of known flags, that should be
kept in sync with the switch statement.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-29 17:24:32 -08:00
Kir Kolyshkin
6462e9de67 runc update: implement memory.checkBeforeUpdate
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.

In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.

In cgroup v2, same operation succeeds, and the container is OOM killed.

Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.

Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.

Add some test cases.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-02 17:15:26 -07:00
Kir Kolyshkin
45cc290f02 libct: fixes for godoc 1.19
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.

Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.

Some more fixes (and enhancements) might be required.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-08-16 09:53:54 -07:00
Alban Crequy
58ea21daef seccomp: add support for flags
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW

Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.

See similar commit in crun:
fefabffa28

Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
2022-07-28 16:25:26 +02:00
wineway
d160116055 libct: use unix.Getwd instead of os.Getwd to avoid symlink
Signed-off-by: wineway <wangyuweihx@gmail.com>
2022-05-11 17:25:34 -07:00
Kir Kolyshkin
2ce40b6ad7 Remove tun/tap from the default device rules
Looking through git blame, this was added by commit 9fac18329
aka "Initial commit of runc binary", most probably by mistake.

Obviously, a container should not have access to tun/tap device, unless
it is explicitly specified in configuration.

Now, removing this might create a compatibility issue, but I see no
other choice.

Aside from the obvious misconfiguration, this should also fix the
annoying

> Apr 26 03:46:56 foo.bar systemd[1]: Couldn't stat device /dev/char/10:200: No such file or directory

messages from systemd on every container start, when runc uses systemd
cgroup driver, and the system runs an old (< v240) version of systemd
(the message was presumably eliminated by [1]).

[1] d5aecba6e0

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-05-04 15:38:58 -07:00
Masahiro Yamada
5222928650 libct/specconv: use a local variable in CreateCgroupConfig()
Use r instead of spec.Linux.Resources to be consistent throughout
this code hunk.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2022-03-29 22:48:06 +09:00
Kir Kolyshkin
376c988618 libct/specconv: improve checkPropertyName
Commit 029b73c1b replaced a regular expression with code checking the
characters. Despite what the comment said about ASCII, the check was
performed rune by rune, not byte by byte.

Note the check was still correct, basically comparing int32's, but the
byte by byte way is a tad faster and more straightforward. The change
also fixes the issue of a misleading comment.

Benchmark before/after shows a modest improvement:

name                 old time/op    new time/op    delta
CheckPropertyName-4     164ns ± 2%     123ns ± 2%  -24.73%  (p=0.029 n=4+4)

name                 old alloc/op   new alloc/op   delta
CheckPropertyName-4     96.0B ± 0%     64.0B ± 0%  -33.33%  (p=0.029 n=4+4)

name                 old allocs/op  new allocs/op  delta
CheckPropertyName-4      6.00 ± 0%      4.00 ± 0%  -33.33%  (p=0.029 n=4+4)

Fixes: 029b73c1b
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-01 11:47:31 -08:00
Kir Kolyshkin
c45eed9a4a libct/specconv: rm empty key from mountPropagationMapping
It is a tad cleaner this way.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-07 11:22:22 -08:00
Akihiro Suda
382eba4354 Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Requires kernel >= 5.12.

The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.

Fix issue 2823

Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-07 17:39:57 +09:00
Akihiro Suda
ba935a51a6 Support nosymfollow mount option (kernel 5.10)
See MS_NOSYMFOLLOW in mount(2)

https://man7.org/linux/man-pages/man2/mount.2.html

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-07 17:33:49 +09:00
Aleksa Sarai
cdce249635 merge branch 'pr-3057'
Fraser Tweedale (1):
  chown cgroup to process uid in container namespace

LGTMs: kolyshkin cyphar
Closes #3057
2021-12-07 17:06:19 +11:00
Akihiro Suda
520702dac5 Add runc features command
Fix issue 3274

See `types/features/features.go`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-11-30 16:40:39 +09:00
Fraser Tweedale
35d20c4e0b chown cgroup to process uid in container namespace
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  https://github.com/opencontainers/runtime-spec/pull/1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
2021-11-30 08:52:59 +10:00
Aleksa Sarai
dde509df4e specconv: do not permit null bytes in mount fields
Using null bytes as control characters for sending strings via netlink
opens us up to a user explicitly putting a null byte in a mount string
(which JSON will happily let you do) and then causing us to open a mount
path different to the one expected.

In practice this is more of an issue in an environment such as
Kubernetes where you may have path-based access control policies (which
are more susceptible to these kinds of flaws).

Found by Google Project Zero.

Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-11-19 11:41:05 +11:00
Kir Kolyshkin
643f8a2b40 libct/specconv: nits
1. Decapitalize errors.
2. Rename isValidName to checkPropertyName.
3. Make it return a specific error.

Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-17 17:32:28 -08:00
Kir Kolyshkin
029b73c1b0 libct/spec: replace isValidName regex with a function
Also, add a simple test and a benchmark (just out of sheer curiosity).

Benchmark results:

name           old time/op  new time/op  delta
IsValidName-4   540ns ± 3%    45ns ± 1%  -91.76%  (p=0.008 n=5+5)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:49 -08:00
Kir Kolyshkin
6907becaf9 libct/specconv: remove isSecSuffix regex
Commit 1cd71dfd7 added isSecSuffix, but the same thing can be done
easily without a regex. This is faster and saves some init time and
memory.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:47 -08:00
Kir Kolyshkin
37c5fd554e libct/specconv: make parseMountOptions return Mount
parseMountOption already returns way too many values, making the code
kind of hard to read.

Since all of the return values are used as is to populate the fields of
configs.Mount, let's change it to return (semi-)populated *configs.Mount
instead.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:30 -08:00
Kir Kolyshkin
2c3792baf9 libct/specconv: make mountFlags and extensionFlags global
This makes the repeated calls to parseMountOptions faster,
and decreases the amount of garbage to collect.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:16:04 -08:00
Kir Kolyshkin
81586e1935 libct/specconv: reuse mountPropagationMapping in parseMountOptions
These two maps are the same, except that mountPropagationMapping
has an extra element with key of "" and value of 0. Since the
code already checks for f != 0, this extra element is not a problem.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:15:54 -08:00
Kir Kolyshkin
8fe1e8bf8c libct/specconv: rm some init allocations
Eliminate some of these allocations when starting runc:

> init github.com/opencontainers/runc/libcontainer/specconv @10 ms, 0.11 ms clock, 5408 bytes, 70 allocs

Most of this (4K) is the two regexes, which are left intact for now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:15:37 -08:00