Commit Graph

249 Commits

Author SHA1 Message Date
Kir Kolyshkin
606257c6e1 Bump golangci-lint to v1.60, fix new warnings
The warnings fixed were:

libcontainer/configs/config_test.go:205:12: printf: non-constant format string in call to (*testing.common).Errorf (govet)
		t.Errorf(fmt.Sprintf("Expected error to not occur but it was %+v", err))
		         ^
libcontainer/cgroups/fs/blkio_test.go:481:13: printf: non-constant format string in call to (*testing.common).Errorf (govet)
			t.Errorf(fmt.Sprintf("test case '%s' failed unexpectedly: %s", testCase.desc, err))
			         ^
libcontainer/cgroups/fs/blkio_test.go:595:13: printf: non-constant format string in call to (*testing.common).Errorf (govet)
			t.Errorf(fmt.Sprintf("test case '%s' failed unexpectedly: %s", testCase.desc, err))
			         ^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-08-14 20:39:15 +08:00
Sebastiaan van Stijn
c14213399a remove pre-go1.17 build-tags
Removed pre-go1.17 build-tags with go fix;

    go fix -mod=readonly ./...

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-29 15:45:25 +02:00
utam0k
bfbd0305ba Add I/O priority
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-03-30 22:31:54 +09:00
lengrongfu
68438ba272 fix scheduler validate
Signed-off-by: lengrongfu <lenronfu@gmail.com>
2024-01-05 09:50:41 +08:00
lfbzhm
371ff9c5e7 Merge pull request #3985 from cyphar/idmap-generic
libcontainer: remove all mount logic from nsexec
2023-12-18 13:10:45 +08:00
Aleksa Sarai
482e56379a configs: make id mappings int64 to better handle 32-bit
Using ints for all of our mapping structures means that a 32-bit binary
errors out when trying to parse /proc/self/*id_map:

  failed to cache mappings for userns: failed to parse uid_map of userns /proc/1/ns/user:
  parsing id map failed: invalid format in line "         0          0 4294967295": integer overflow on token 4294967295

This issue was unearthed by commit 1912d5988b ("*: actually support
joining a userns with a new container") but the underlying issue has
been present since the docker/libcontainer days.

In theory, switching to uint32 (to match the spec) instead of int64
would also work, but keeping everything signed seems much less
error-prone. It's also important to note that a mapping might be too
large for an int on 32-bit, so we detect this during the mapping.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 12:14:32 +11:00
Aleksa Sarai
3b57e45cbf mount: add support for ridmap and idmap
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.

This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
5ae88daf06 idmap: allow arbitrary idmap mounts regardless of userns configuration
With the rework of nsexec.c to handle MOUNT_ATTR_IDMAP in our Go code we
can now handle arbitrary mappings without issue, so remove the primary
artificial limit of mappings (must use the same mapping as the
container's userns) and add some tests.

We still only support idmap mounts for bind-mounts because configuring
mappings for other filesystems would require switching our entire mount
machinery to the new mount API. The current design would easily allow
for this but we would need to convert new mount options entirely to the
fsopen/fsconfig/fsmount API. This can be done in the future.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
09822c3da8 configs: disallow ambiguous userns and timens configurations
For userns and timens, the mappings (and offsets, respectively) cannot
be changed after the namespace is first configured. Thus, configuring a
container with a namespace path to join means that you cannot also
provide configuration for said namespace. Previously we would silently
ignore the configuration (and just join the provided path), but we
really should be returning an error (especially when you consider that
the configuration userns mappings are used quite a bit in runc with the
assumption that they are the correct mapping for the userns -- but in
this case they are not).

In the case of userns, the mappings are also required if you _do not_
specify a path, while in the case of the time namespace you can have a
container with a timens but no mappings specified.

It should be noted that the case checking that the user has not
specified a userns path and a userns mapping needs to be handled in
specconv (as opposed to the configuration validator) because with this
patchset we now cache the mappings of path-based userns configurations
and thus the validator can't be sure whether the mapping is a cached
mapping or a user-specified one. So we do the validation in specconv,
and thus the test for this needs to be an integration test.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Aleksa Sarai
3bab7e9223 configs: clean up error messages for Host[UG]ID
If a user has misconfigured their userns mappings, they need to know
which id specifically is not mapped. There's no need to be vague.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:09 +11:00
Aleksa Sarai
1912d5988b *: actually support joining a userns with a new container
Our handling for name space paths with user namespaces has been broken
for a long time. In particular, the need to parse /proc/self/*id_map in
quite a few places meant that we would treat userns configurations that
had a namespace path as if they were a userns configuration without
mappings, resulting in errors.

The primary issue was down to the id translation helper functions, which
could only handle configurations that had explicit mappings. Obviously,
when joining a user namespace we need to map the ids but figuring out
the correct mapping is non-trivial in comparison.

In order to get the mapping, you need to read /proc/<pid>/*id_map of a
process inside the userns -- while most userns paths will be of the form
/proc/<pid>/ns/user (and we have a fast-path for this case), this is not
guaranteed and thus it is necessary to spawn a process inside the
container and read its /proc/<pid>/*id_map files in the general case.

As Go does not allow us spawn a subprocess into a target userns,
we have to use CGo to fork a sub-process which does the setns(2). To be
honest, this is a little dodgy in regards to POSIX signal-safety(7) but
since we do no allocations and we are executing in the forked context
from a Go program (not a C program), it should be okay. The other
alternative would be to do an expensive re-exec (a-la nsexec which would
make several other bits of runc more complicated), or to use nsenter(1)
which might not exist on the system and is less than ideal.

Because we need to logically remap users quite a few times in runc
(including in "runc init", where joining the namespace is not feasable),
we cache the mapping inside the libcontainer config struct. A future
patch will make sure that we stop allow invalid user configurations
where a mapping is specified as well as a userns path to join.

Finally, add an integration test to make sure we don't regress this again.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-05 17:46:08 +11:00
Aleksa Sarai
669f4dbef8 configs: validate: add validation for bind-mount fsflags
Bind-mounts cannot have any filesystem-specific "data" arguments,
because the kernel ignores the data argument for MS_BIND and
MS_BIND|MS_REMOUNT and we cannot safely try to override the flags
because those would affect mounts on the host (these flags affect the
superblock).

It should be noted that there are cases where the filesystem-specified
flags will also be ignored for non-bind-mounts but those are kernel
quirks and there's no real way for us to work around them. And users
wouldn't get any real benefit from us adding guardrails to existing
kernel behaviour.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-11-02 07:50:03 +11:00
Rodrigo Campos
4bf8b55594 libct: Remove old comment
We changed it in PR:
	https://github.com/opencontainers/runtime-spec/pull/1225

But we missed to remove this comment.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-11-01 12:48:42 +01:00
Zheao.Li
98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Aleksa Sarai
7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
utam0k
770728e16e Support process.scheduler
Spec: https://github.com/opencontainers/runtime-spec/pull/1188
Fix: https://github.com/opencontainers/runc/issues/3895

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: utam0k <k0ma@utam0k.jp>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-10-04 15:53:18 +08:00
Rodrigo Campos
b17c6f237d validator: Relax warning for not abs mount dst path
The runtime spec now allows relative mount dst paths, so remove the
comment saying we will switch this to an error later and change the
error messages to reflect that.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-09-11 16:02:41 +02:00
Kailun Qin
e1584831b6 libct/cg: add CFS bandwidth burst for CPU
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.

This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.

The patch adds the support/control for CFS bandwidth burst.

runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120

Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
2023-09-06 23:23:30 +08:00
Kir Kolyshkin
6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Aleksa Sarai
1f25724a96 configs: fix idmapped mounts json field names
In the runc state JSON we always use snake_case. This is a no-op change,
but it will cause any existing container state files to be incorrectly
parsed. Luckily, commit fbf183c6f8 ("Add uid and gid mappings to
mounts") has never been in a runc release so we can change this before a
1.2.z release.

Fixes: fbf183c6f8 ("Add uid and gid mappings to mounts")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-15 19:54:24 -07:00
Aleksa Sarai
aa5f4c1137 tests: add several timens tests
These are not exhaustive, but at least confirm that the feature is not
obviously broken (we correctly set the time offsets).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-10 19:01:31 +10:00
Aleksa Sarai
9acfd7b1a3 timens: minor cleanups
Fix up a few things that were flagged in the review of the original
timens PR, namely around error handling and validation.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-10 18:59:55 +10:00
Aleksa Sarai
0866112e81 merge #3876 into opencontainers/runc:main
Chethan Suresh (1):
  Support time namespace

LGTMs: kolyskin cyphar
Closes #3876
2023-08-10 18:27:17 +10:00
Rodrigo Campos
19d26a6596 Revert "libct/validator: Error out on non-abs paths"
This reverts commit 881e92a3fd and adjust
the code so the idmap validations are strict.

We now only throw a warning and the container is started just fine.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-08-08 13:45:31 +02:00
Chethan Suresh
ebc2e7c435 Support time namespace
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset

Example runtime spec

"timeOffsets": {
    "monotonic": {
        "secs": 172800,
        "nanosecs": 0
    },
    "boottime": {
        "secs": 604800,
        "nanosecs": 0
    }
}

Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
2023-08-03 10:12:01 +05:30
Ruediger Pluem
da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
lifubang
c875ea8529 use the length of UIDMappings/GIDMappings to check whether empty or not
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-07-23 11:59:18 +08:00
Francis Laniel
c47f58c4e9 Capitalize [UG]idMappings as [UG]IDMappings
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Rodrigo Campos
fbf183c6f8 Add uid and gid mappings to mounts
Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
Rodrigo Campos
881e92a3fd libct/validator: Error out on non-abs paths
This was a warning already and it was requested to make this an error
while we will add validation of idmap mounts:
	https://github.com/opencontainers/runc/pull/3717#discussion_r1154705318

I've also tested a k8s cluster and the config.json generated by
containerd didn't use any relative paths. I tested one pod, so it was
definitely not an extensive test.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-07 12:00:33 +02:00
Kir Kolyshkin
2a7dcbbb40 libct: fix shared pidns detection
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.

The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.

Fix the issue, add a test case (which fails without the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:23:29 -07:00
utam0k
d9230602e9 Implement to set a domainname
opencontainers/runtime-spec#1156

Signed-off-by: utam0k <k0ma@utam0k.jp>
2023-04-12 13:31:20 +00:00
wineway
81c379fa8b support SCHED_IDLE for runc cgroupfs
Signed-off-by: wineway <wangyuweihx@gmail.com>
2023-01-31 15:19:05 +08:00
Eric Ernst
e29e57b5fc libcontainer: configs: ensure can build on darwin
configs package can no longer be built on non-Linux OS, such as Darwin.

When running `GOOS=darwin go build` on the packge, we had the following
errors:
```
./configs/mount.go:34:16: undefined: unix.MountAttr
./configs/mount.go:47:22: undefined: unix.MS_BIND
```

Let's ensure that the linux specific bits are handled in mount_linux.go,
and introduce a _unsupported file, similar to how cgroups file is
handled within the package. This'll facilitate utilization of the pkg
for other projects that care about Darwin.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2023-01-17 15:57:08 -08:00
Kir Kolyshkin
6462e9de67 runc update: implement memory.checkBeforeUpdate
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.

In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.

In cgroup v2, same operation succeeds, and the container is OOM killed.

Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.

Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.

Add some test cases.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-02 17:15:26 -07:00
Kir Kolyshkin
45cc290f02 libct: fixes for godoc 1.19
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.

Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.

Some more fixes (and enhancements) might be required.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-08-16 09:53:54 -07:00
Alban Crequy
58ea21daef seccomp: add support for flags
List of seccomp flags defined in runtime-spec:
* SECCOMP_FILTER_FLAG_TSYNC
* SECCOMP_FILTER_FLAG_LOG
* SECCOMP_FILTER_FLAG_SPEC_ALLOW

Note that runc does not apply SECCOMP_FILTER_FLAG_TSYNC. It does not
make sense to apply the seccomp filter on only one thread; other threads
will be terminated after exec anyway.

See similar commit in crun:
fefabffa28

Note that SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (introduced by
https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?id=c2aa2dfef243
in Linux 5.19-rc1) is not added yet because Linux 5.19 is not released
yet.

Signed-off-by: Alban Crequy <albancrequy@microsoft.com>
2022-07-28 16:25:26 +02:00
cdoern
c0be1aa2d1 export blockIODevice
the struct blockIODevice is used in an exported struct but it is not itself exported rendering that type inaccessible to
outside projects

Signed-off-by: cdoern <cdoern@redhat.com>
2022-06-13 13:40:39 -04:00
Kir Kolyshkin
48006d0007 libct/configs/validate: rootlessEUIDMount: speedup
1. Fix function docs. In particular, remove the part
   which is not true ("verifies that the user isn't trying to set up any
   mounts they don't have the rights to do"), and fix the part that
   says "that doesn't resolve to root" (which is no longer true since
   commit d8b669400a).

2. Replace fmt.Sscanf (which is slow and does lots of allocations)
   with strings.TrimPrefix and strconv.Atoi.

3. Add a benchmark for rootlessEUIDMount. Comparing the old and the new
   implementations:

	name                 old time/op    new time/op    delta
	RootlessEUIDMount-4    1.01µs ± 2%    0.16µs ± 1%  -84.15%  (p=0.008 n=5+5)

	name                 old alloc/op   new alloc/op   delta
	RootlessEUIDMount-4      224B ± 0%       80B ± 0%  -64.29%  (p=0.008 n=5+5)

	name                 old allocs/op  new allocs/op  delta
	RootlessEUIDMount-4      7.00 ± 0%      1.00 ± 0%  -85.71%  (p=0.008 n=5+5)

Note this code is already tested (in rootless_test.go).

Fixes: d8b669400a
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-17 13:39:55 -07:00
Markus Lehtonen
1d5c331042 configs/validate: looser validation for RDT
Don't require CAT or MBA because we don't detect those correctly (we
don't support L2 or L3DATA/L3CODE for example, and in the future
possibly even more). With plain "ClosId mode" we don't really care: we
assign the container to a pre-configured CLOS without trying to do
anything smarter.

Moreover, this was a duplicate/redundant check anyway, as for CAT and
MBA there is another specific sanity check that is done if L3 or MB
is specified in the config.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2022-02-18 16:24:50 +02:00
Kir Kolyshkin
0d21515038 libct: remove Validator interface
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.

Having validate.Validator field in Factory does not make sense for all
the same reasons.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Sebastiaan van Stijn
7173c53906 Merge pull request #3350 from kolyshkin/mount-cmds
libct: Mount: rm {Pre,Post}mountCmds
2022-02-03 10:12:56 +01:00
Kir Kolyshkin
1b14d97484 libct/configs: rm Windows TODO
It's clear at this point that runc won't support Windows.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 18:59:47 -08:00
Kir Kolyshkin
0fec1c2d8c libct: Mount: rm {Pre,Post}mountCmds
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 15:51:55 -08:00
Akihiro Suda
382eba4354 Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Requires kernel >= 5.12.

The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.

Fix issue 2823

Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-07 17:39:57 +09:00
Aleksa Sarai
cdce249635 merge branch 'pr-3057'
Fraser Tweedale (1):
  chown cgroup to process uid in container namespace

LGTMs: kolyshkin cyphar
Closes #3057
2021-12-07 17:06:19 +11:00
Akihiro Suda
520702dac5 Add runc features command
Fix issue 3274

See `types/features/features.go`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-11-30 16:40:39 +09:00
Fraser Tweedale
35d20c4e0b chown cgroup to process uid in container namespace
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  https://github.com/opencontainers/runtime-spec/pull/1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
2021-11-30 08:52:59 +10:00
Mengjiao Liu
a9bb11ec3c Fix the conversion of sysctl variable dots and slashes
Signed-off-by: Mengjiao Liu <mengjiao.liu@daocloud.io>
2021-11-04 11:45:15 +08:00