Commit Graph

123 Commits

Author SHA1 Message Date
Aleksa Sarai
cdce249635 merge branch 'pr-3057'
Fraser Tweedale (1):
  chown cgroup to process uid in container namespace

LGTMs: kolyshkin cyphar
Closes #3057
2021-12-07 17:06:19 +11:00
Akihiro Suda
520702dac5 Add runc features command
Fix issue 3274

See `types/features/features.go`.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-11-30 16:40:39 +09:00
Fraser Tweedale
35d20c4e0b chown cgroup to process uid in container namespace
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads.  The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:

  https://github.com/opencontainers/runtime-spec/pull/1123

Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.

As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.

It might be possible to implement this feature in other cgroup
managers, but that work is deferred.

Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
2021-11-30 08:52:59 +10:00
Aleksa Sarai
dde509df4e specconv: do not permit null bytes in mount fields
Using null bytes as control characters for sending strings via netlink
opens us up to a user explicitly putting a null byte in a mount string
(which JSON will happily let you do) and then causing us to open a mount
path different to the one expected.

In practice this is more of an issue in an environment such as
Kubernetes where you may have path-based access control policies (which
are more susceptible to these kinds of flaws).

Found by Google Project Zero.

Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Reported-by: Felix Wilhelm <fwilhelm@google.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-11-19 11:41:05 +11:00
Kir Kolyshkin
643f8a2b40 libct/specconv: nits
1. Decapitalize errors.
2. Rename isValidName to checkPropertyName.
3. Make it return a specific error.

Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-17 17:32:28 -08:00
Kir Kolyshkin
029b73c1b0 libct/spec: replace isValidName regex with a function
Also, add a simple test and a benchmark (just out of sheer curiosity).

Benchmark results:

name           old time/op  new time/op  delta
IsValidName-4   540ns ± 3%    45ns ± 1%  -91.76%  (p=0.008 n=5+5)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:49 -08:00
Kir Kolyshkin
6907becaf9 libct/specconv: remove isSecSuffix regex
Commit 1cd71dfd7 added isSecSuffix, but the same thing can be done
easily without a regex. This is faster and saves some init time and
memory.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:47 -08:00
Kir Kolyshkin
37c5fd554e libct/specconv: make parseMountOptions return Mount
parseMountOption already returns way too many values, making the code
kind of hard to read.

Since all of the return values are used as is to populate the fields of
configs.Mount, let's change it to return (semi-)populated *configs.Mount
instead.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:23:30 -08:00
Kir Kolyshkin
2c3792baf9 libct/specconv: make mountFlags and extensionFlags global
This makes the repeated calls to parseMountOptions faster,
and decreases the amount of garbage to collect.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:16:04 -08:00
Kir Kolyshkin
81586e1935 libct/specconv: reuse mountPropagationMapping in parseMountOptions
These two maps are the same, except that mountPropagationMapping
has an extra element with key of "" and value of 0. Since the
code already checks for f != 0, this extra element is not a problem.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:15:54 -08:00
Kir Kolyshkin
8fe1e8bf8c libct/specconv: rm some init allocations
Eliminate some of these allocations when starting runc:

> init github.com/opencontainers/runc/libcontainer/specconv @10 ms, 0.11 ms clock, 5408 bytes, 70 allocs

Most of this (4K) is the two regexes, which are left intact for now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 20:15:37 -08:00
Neil Johnson
2e0ceaa935 fix createDevices when no Linux section
Signed-off-by: Neil Johnson <najohnsn@us.ibm.com>
2021-10-04 17:37:19 -04:00
Kir Kolyshkin
097c6d7425 libct/cg: simplify getting cgroup manager
1. Make Rootless and Systemd flags part of config.Cgroups.

2. Make all cgroup managers (not just fs2) return error (so it can do
   more initialization -- added by the following commits).

3. Replace complicated cgroup manager instantiation in factory_linux
   by a single (and simple) libcontainer/cgroups/manager.New() function.

4. getUnifiedPath is simplified to check that only a single path is
   supplied (rather than checking that other paths, if supplied,
   are the same).

[v2: can't -> cannot]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-23 09:11:44 -07:00
Alban Crequy
2b025c0173 Implement Seccomp Notify
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.

A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.

As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:

	writing syncT <act>: writing syncT: <err>

By adjusting the code path, now they just look like this
	writing syncT <act>: <err>

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-09-07 13:04:24 +02:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Kir Kolyshkin
3c7db3827c Merge pull request #2883 from flouthoc/master
Add support for rdma cgroup introduced in Linux Kernel 4.11
2021-08-30 20:02:04 -07:00
flouthoc
b3d14488b5 Add support for rdma cgroup introduced in Linux Kernel 4.11
Signed-off-by: Aditya Rajan <flouthoc.git@gmail.com>
2021-08-23 12:25:33 +05:30
Markus Lehtonen
17e3b41dd0 libcontainer/intelrdt: support ClosID parameter
Handle ClosID parameter of IntelRdt. Makes it possible to use
pre-configured classes/ClosIDs and avoid running out of available IDs
which easily happens with per-container classes.

Remove validator checks for empty L3CacheSchema and MemBwSchema fields
in order to be able to leave them empty, and only specify ClosID for
a pre-configured class.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2021-08-09 15:58:03 +03:00
Kir Kolyshkin
82498e3d77 libct/specconf: remove unneeded checks
In cases we have something like

	if y != "" {
		x = y
	}

where both x and y are strings, and x was not set before,
it makes no sense to have a condition, as such code is
equivalent to mere

	x = y

Simplify such cases by removing "if".

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-28 12:45:28 -07:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
627a06ad92 Replace fmt.Errorf w/o %-style to errors.New
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.

Found by

	git grep fmt.Errorf | grep -v ^vendor | grep -v '%'

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 11:42:07 -07:00
Sebastiaan van Stijn
b31a9340f9 libcontainer: relax validation for absolute paths
Commits 1f1e91b1a0 and 2192670a24
added validation for mountpoints to be an absolute path, to match the OCI
specs.

Unfortunately, the old behavior (accepting the path to be a relative path)
has been around for a long time, and although "not according to the spec",
various higher level runtimes rely on this behavior.

While higher level runtime have been updated to address this requirement,
there will be a transition period before all runtimes are updated to carry
these fixes.

This patch relaxes the validation, to generate a WARNING instead of failing,
allowing runtimes to update (but allowing them to update runc to the current
version, which includes security fixes).

We can remove this exception in a future patch release.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-09 13:20:28 +02:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Giuseppe Scrivano
c61f606254 libcontainer: honor seccomp defaultErrnoRet
https://github.com/opencontainers/runtime-spec/pull/1087 added support
for defaultErrnoRet to the OCI runtime specs.

If a defaultErrnoRet is specified, disable patching the generated
libseccomp cBPF.

Closes: https://github.com/opencontainers/runc/issues/2943

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-05-17 09:23:32 +02:00
Kir Kolyshkin
1f1e91b1a0 libct/specconv: check mount destination is absolute
Per OCI runtime spec, mount destination MUST be absolute. Let's check
that and return an error if not.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 11:26:16 -07:00
Qiang Huang
2d38476c96 Merge pull request #2840 from kolyshkin/ignore-kmem
Ignore kernel memory settings
2021-04-13 09:44:14 +08:00
Kir Kolyshkin
52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00
Kir Kolyshkin
27bb1bd5ea libct/specconv/CreateCgroupConfig: don't set c.Parent default
c.Parent is only used by systemd cgroup drivers, and both v1 and v2
drivers do have code to set the default if it is empty, so setting
it here is redundant.

In addition, in case of cgroup v2 rootless container setting it here
is harmful as the default should be user.slice not system.slice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 19:50:37 -07:00
Kir Kolyshkin
f0dec0b4bf libct/specconv/CreateCgroupConfig: nit
Do not call libcontainerUtils.CleanPath in case its result will
not be used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 19:33:01 -07:00
Iceber Gu
fa52df9493 libcontainer: fix the file mode of the device
Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io>
2021-02-17 15:08:22 +08:00
Aleksa Sarai
7a8d7162f9 seccomp: prepend -ENOSYS stub to all filters
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

 1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
       specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
       added, and has no API for "minimum kernel version" so we cannot
       simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
       distinct rules -- if rules overlap, seccomp will merge them. This
       means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
       single argument, making it impossible to invert OR rules for
       arguments.

 2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
       creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
       (weird syscalls that are no longer used were excluded entirely,
       but Docker et al expect those syscalls to get EPERM not ENOSYS).

 3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

 4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

 5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.

Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.

The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-28 23:11:22 +11:00
Sebastiaan van Stijn
4fc2de77e9 libcontainer/devices: remove "Device" prefix from types
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:23 +01:00
Sebastiaan van Stijn
677baf22d2 libcontainer: isolate libcontainer/devices
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:21 +01:00
Ashok Pon Kumar
fcf210d631 Fix goreport warnings of ineffassign and misspell
Signed-off-by: Ashok Pon Kumar <ashokponkumar@gmail.com>
2020-10-02 09:03:45 +05:30
Kenta Tada
3d5dec2f44 libcontainer: remove the unused variable from spec
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-10-01 12:29:48 +09:00
Sebastiaan van Stijn
8bf216728c use string-concatenation instead of sprintf for simple cases
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-30 10:51:59 +02:00
Kir Kolyshkin
b006f4a180 libct/cgroups: support Cgroups.Resources.Unified
Add support for unified resource map (as per [1]), and add some test
cases for the new functionality.

[1] https://github.com/opencontainers/runtime-spec/pull/1040

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-09-24 15:29:35 -07:00
Giuseppe Scrivano
a63f99fcc5 Add support for umask
Signed-off-by: Ashley Cui <acui@redhat.com>
2020-08-20 11:39:43 -04:00
Cesar Talledo
0709202da7 Remove runc default devices that overlap with spec devices.
Runc has a set of default devices that it includes in Linux containers
(e.g., /dev/null, /dev/random, /dev/tty, etc.)

However if the container's OCI spec includes all or a subset of those same devices,
runc is currently not detecting the redundancy, causing it to create a lib
container config that has redundant device configurations.

This causes a failure in rootless mode, in particular when the /dev/tty device
has a redundant config:

container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:70: creating device nodes caused: open /tmp/busyboxtest/rootfs/dev/tty: no such device or address"

The reason this fails in rootless mode only is that in this case runc sets up
/dev/tty not by doing mknod (it's not allowed within a user-ns) but rather by
creating a regular file under /dev/tty and bind-mounting the host's /dev/tty to
the container's /dev/tty. When this operation is done redundantly, it fails the
second time.

This change fixes this problem by ensuring runc checks for redundant devices
between the OCI spec it receives and the default devices it configures. If
a redundant device is detected, the OCI spec takes priority.

The change adds both a unit test and an integration test to verify the
behavior. Without this fix, this new integration test fails as shown above.

Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>
2020-08-07 16:46:15 -07:00
Renaud Gaubert
ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Kir Kolyshkin
4189cb65f8 cgroups: remove cgroup.Resources.CpuMax
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:15:38 -07:00
Giuseppe Scrivano
41aa19662b libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
Aleksa Sarai
24388be71e configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
60e21ec26e specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
b2bec9806f cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Pradyumna Agrawal
4aa9101477 Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig

Closes #2397

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-05-11 13:38:14 -07:00
lifubang
d2a9c5da37 using default allowed devices when linux resources is null
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-16 11:40:44 +08:00
Akihiro Suda
cc183ca662 Merge pull request #2242 from AkihiroSuda/vendor-systemd
vendor: update go-systemd and godbus
2020-03-25 02:40:22 +09:00
Akihiro Suda
492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Akihiro Suda
aa269315a4 cgroup2: add CpuMax conversion
Fix #2243

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:58:39 +09:00