Commit Graph

376 Commits

Author SHA1 Message Date
Mrunal Patel
8772c4dd2f Merge pull request #3109 from kolyshkin/seccomp
seccomp: skip redundant rules
2021-08-03 23:09:00 -04:00
Kir Kolyshkin
60e02b4b25 runc exec: fail with exit code of 255
Currently there's no way to distinguish between the two cases:
 - runc exec failed;
 - the command executed returned 1.

This was possible before commit 8477638aab, as runc exec exited with
the code of 255 if exec itself has failed. The code of 255 is the same
convention as used by e.g. ssh.

Re-introduce the feature, document it, and add some tests so it won't be
broken again.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-28 13:46:13 -07:00
Kir Kolyshkin
5dd92fd9b4 libct/seccomp: skip redundant rules
This fixes using runc with podman on my system (Fedora 34).

> $ podman --runtime `pwd`/runc run --rm --memory 4M fedora echo it works
> Error: unable to start container process: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied

The problem is, libseccomp returns EPERM when a redundant rule (i.e. the
rule with the same action as the default one) is added, and podman (on
my machine) sets the following rules in config.json:

    <....>
    "seccomp": {
      "defaultAction": "SCMP_ACT_ERRNO",
      "architectures": [
        "SCMP_ARCH_X86_64",
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
      ],
      "syscalls": [
        {
          "names": [
            "bdflush",
            "io_pgetevents",
            <....>
          ],
          "action": "SCMP_ACT_ERRNO",
          "errnoRet": 1
        },
        <....>

(Note that defaultErrnoRet is not set, but it defaults to 1).

With this commit, it works:

> $ podman --runtime `pwd`/runc run --memory 4M fedora echo it works
> it works

Add an integration test (that fails without the fix).

Similar crun commit:
 * https://github.com/containers/crun/commit/08229f3fb904c5ea19a7d9

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-27 00:04:59 -07:00
Adrian Reber
9f656dbb11 Do not use Vagrant for CentOS 7/8
As Cirrus CI does not provide a real terminal this uses the same
'ssh -tt' workaround as the Vagrant setup. This sets up the
CentOS 7 and 8 to allow SSH as root to localhost so that we can run
all the tests via 'ssh -tt'.

Not going through vagrant reduces CI times for CentOS 7 and 8 from 6
minutes to 4 minutes.

Signed-off-by: Adrian Reber <areber@redhat.com>
2021-07-23 09:23:23 +02:00
Kir Kolyshkin
d448016486 tests/rootless.sh: fixup for "update rt" test
Without this, the test case fails with

> Writing 1000000 to /sys/fs/cgroup/cpu,cpuacct/runc-cgroups-integration-test/cpu.rt_period_us
> /tmp/bats-run-106836/bats.116418.src: line 548: /sys/fs/cgroup/cpu,cpuacct/runc-cgroups-integration-test/cpu.rt_period_us: Permission denied

Since we do not currently have a setup to test this, this went
unnoticed (can be seen in RHEL8 though).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
2021-07-23 09:23:23 +02:00
Kir Kolyshkin
86af524866 tests/int: fix "update rt period and runtime" for rootless
Since commit f09a3e1b8d, the value passed on to read starts with
a slash, resulting in the first element of the array to be empty.

As a result, the test tries to write to the top-level cgroup, which
fails when rootless:

> # Writing 1000000 to /sys/fs/cgroup/cpu,cpuacct//cpu.rt_period_us
> # /tmp/bats-run-106184/bats.115768.src: line 548: /sys/fs/cgroup/cpu,cpuacct//cpu.rt_period_us: Permission denied

To fix, remove the leading slash.

An alternative fix would be to do "for ((i = 1;" instead of "i = 0", but
that seems less readable.

Fixes: f09a3e1b8d
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-23 09:23:23 +02:00
Aleksa Sarai
2c01cec8ac merge branch 'pr-3067'
Odin Ugedal (2):
  libct/cg/sd: Don't freeze cgroup on cgroup v2 Set
  Update device update tests

LGTMs: kolyshkin mrunalp cyphar
Closes #3067
2021-07-13 12:59:29 +10:00
Odin Ugedal
d41a273dae Update device update tests
Run device update tests on cgroup v2, and add a test verifying that we
don't allow access to devices when we don't intend to.

Signed-off-by: Odin Ugedal <odin@uged.al>
2021-07-07 22:44:08 +02:00
Odin Ugedal
6be088d69d tests/int/dev: add CAP_SYSLOG to /dev/kmsg tests
Add CAP_SYSLOG to ensure that /dev/kmsg can be accesses on systems where
the sysctl kernel.dmesg_restrict = 1.

Signed-off-by: Odin Ugedal <odin@uged.al>
2021-07-07 15:44:16 +02:00
Akihiro Suda
5547b5774f Merge pull request #3033 from kolyshkin/rm-own-errors
libcontainer: rm own error system
2021-07-01 13:47:27 +09:00
Kir Kolyshkin
1bbeadae72 tests/int/no_pivot: fix for new kernels
The test is failing like this:

	not ok 70 runc run --no-pivot must not expose bare /proc
	# (in test file tests/integration/no_pivot.bats, line 20)
	#   `[[ "$output" == *"mount: permission denied"* ]]' failed
	# runc spec (status=0):
	#
	# runc run --no-pivot test_no_pivot (status=1):
	# unshare: write error: Operation not permitted

Apparently, a recent kernel commit db2e718a47984b9d prevents
root from doing unshare -r unless it has CAP_SETFPCAP.

Add the capability for this specific test.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-29 13:31:54 -07:00
Kir Kolyshkin
70fdc0573d Revert "checkpoint: resolve symlink for external bind mount"
This reverts commit da22625f69
(PR 2902).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 20:52:43 -07:00
Mrunal Patel
1079288bef Merge pull request #2902 from liusdu/checkpoint
checkpoint: resolve symlink for external bind mount
2021-06-24 22:52:02 -04:00
Kir Kolyshkin
e918d02139 libcontainer: rm own error system
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.

Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.

While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".

A few things that are worth mentioning:

1. We lose stack traces (which were never shown anyway).

2. Users of libcontainer that relied on particular errors (like
   ContainerNotExists) need to switch to using errors.Is with
   the new errors defined in error.go.

3. encoding/json is unable to unmarshal the built-in error type,
   so we have to introduce initError and wrap the errors into it
   (basically passing the error as a string). This is the same
   as it was before, just a tad simpler (actually the initError
   is a type that got removed in commit afa844311; also suddenly
   ierr variable name makes sense now).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 10:21:04 -07:00
Kir Kolyshkin
2916817278 tests/int/cgroups: add test for bfq per-device weight
This works for both cgroup v1 and v2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-16 04:51:40 -07:00
Mrunal Patel
a3813ca249 Merge pull request #3000 from kolyshkin/test-dev-update
libct/int: add device update test
2021-06-11 14:36:49 -04:00
Kir Kolyshkin
bd8e070109 libct/cg/sd: fix "SkipDevices" handling
1. The meaning of SkipDevices is what it is -- do not set any
   device-related options.

2. Reverts the part of commit 108ee85b82 which skipped the freeze
   when the SkipDevices is set. Apparently, the freeze is needed on
   update even if no Device* properties are being set.

3. Add "runc update" to "runc run [device cgroup deny]" test.

Fixes: 752e7a8249
Fixes: 108ee85b82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-10 12:01:21 -07:00
Kir Kolyshkin
508f5bf665 libct/int: add device update test
... and remove the one from tests/integration.

The idea is similar to the one for the test case being removed -- try
updating device rules many times to make sure we are not leaking eBPF
programs after every update/Set(). This is better though as we can
really change the device rules every time (which "runc update" can't)
and check that the rule is applied.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-05 16:11:29 -07:00
Kir Kolyshkin
f5a2c9cced tests/int/dev: only call lsblk once
The "runc run [device cgroup allow rm block device]" test calls lsblk
three times to get device name, minor and major number. This creates a
potential problem when the devices are changed between the calls.

Simplify the code by using bash read together with IFS (as there's no
way to have lsblk output MAJOR:MINOR pair without a semicolon).

Note that head -n 1 is not needed as read already reads a single line.

[v2: don't use PATH as CentOS7's lsblk does not support it.]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-25 12:21:42 -07:00
Aleksa Sarai
00119c85a9 integration: add repeated "runc update" test
This is to ensure that we aren't leaking eBPF programs after "runc
update". Unfortunately we cannot directly test the behaviour of cgroup
program updates in an integration test because "runc update" doesn't
support that behaviour at the moment.

So instead we rely on the fact that each "runc update" implicitly
triggers the devices rules to be updated. Without the previous patches
applied, this new test will fail with errors (on cgroupv2 systems).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-23 17:55:24 +10:00
Giuseppe Scrivano
c61f606254 libcontainer: honor seccomp defaultErrnoRet
https://github.com/opencontainers/runtime-spec/pull/1087 added support
for defaultErrnoRet to the OCI runtime specs.

If a defaultErrnoRet is specified, disable patching the generated
libseccomp cBPF.

Closes: https://github.com/opencontainers/runc/issues/2943

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2021-05-17 09:23:32 +02:00
Aleksa Sarai
62250c13fe merge branch 'pr-2944'
Kir Kolyshkin (8):
  tests/int/helpers: rm old code
  libct/cg/sd/v2: call initPath from Path
  tests/int/update.bats: don't set cpuset in setup
  tests/int/helpers: generalize require cgroups_freezer
  tests/int: enable/use requires cgroups_<ctrl>
  tests/int/cgroups: don't check for hugetlb
  tests/int: run rootless_cgroup tests for v2+systemd
  Vagrantfile.fedora: set Delegate=yes

LGTMs: AkihiroSuda cyphar
Closes #2944
2021-05-10 11:16:11 +10:00
Kir Kolyshkin
ac70a9a1b2 tests/int: run rootless_cgroup tests for v2+systemd
Before this commit, "require rootless_cgroup" feature check required "cgroup"
to be present in ROOTLESS_FEATURES. The idea of the requirement, though, is
to ensure that rootless runc can manage cgroups.

In case of systemd + cgroup v2, rootless runc can manage cgroups,
thanks to systemd delegation, so modify the feature check accordingly.

Next, convert (simplify) some of the existing users to the modified check.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:07:27 -07:00
Kir Kolyshkin
601cf5825f tests/int/cgroups: don't check for hugetlb
Systemd is not able to delegate hugetlb controller, and it is needed for
cgroup v2 + systemd + rootless case (which is currently skipped because
of "requires rootless_cgroup", but will be enabled by a later commit).

The failure being fixed looks like this:

> not ok 4 runc create (limits + cgrouppath + permission on the cgroup dir) succeeds
> # (from function `check_cgroup_value' in file /vagrant/tests/integration/helpers.bash, line 188,
> #  in test file /vagrant/tests/integration/cgroups.bats, line 53)
> #   `check_cgroup_value "cgroup.controllers" "$(cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.controllers)"' failed
> # <....>
> # /sys/fs/cgroup/user.slice/user-2000.slice/user@2000.service/machine.slice/runc-cgroups-integration-test-20341.scope/cgroup.controllers
> # current cpuset cpu io memory pids !? cpuset cpu io memory hugetlb pids

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:07:25 -07:00
Kir Kolyshkin
40b9791916 tests/int: enable/use requires cgroups_<ctrl>
1. In case of cgroup v2 + systemd + rootless, get the list of controllers from
   the own cgroup, rather than from the root one (as systemd cgroup delegation
   is required in this case, and it might not be set or fully working).

2. Use "requires cgroups_<controller>" in tests that need those
   controllers.

Tested on Fedora 31 (which does not support delegation of cpuset controller).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:07:08 -07:00
Kir Kolyshkin
44fcbfd68f tests/int/helpers: generalize require cgroups_freezer
With this, something like cgroups_pids can be used (to be added
by the next commit).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:06:13 -07:00
Kir Kolyshkin
353f2ad193 tests/int/update.bats: don't set cpuset in setup
Setting cpuset.cpus requires cpuset controller, which might
not be available in case of cgroup v2 + systemd < 244.

Rather than require cpuset support from every test case, do not set the initial
cpuset value from setup(), and do not check it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:06:08 -07:00
Kir Kolyshkin
0ed1f8022d tests/int/helpers: rm old code
This was added by commit ca4f427af, together with set_resources_limit,
and was needed for the latter, since sed was used to update resources.

Now when we switched to jq in commit 79fe41d3c1, this kludge is no
longer needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-07 22:05:49 -07:00
Qiang Huang
fc8a0d97a9 Merge pull request #2913 from liusdu/interative
tiny fix iterative checkpoint test case
2021-05-08 10:16:30 +08:00
Liu Hua
da22625f69 checkpoint: resolve symlink for external bind mount
runc resolves symlink before doing bind mount. So
we should save original path while formatting CriuReq for
checkpoint.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2021-04-25 09:50:00 +08:00
Liu Hua
23e3794d9c checkpoint: validate parent path
`--parent-path` needs to be relative path of `--image-path`,
and points to previous checkpoint directory.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2021-04-23 13:20:08 +08:00
Mrunal Patel
42a18e7f02 Merge pull request #2818 from kolyshkin/rootless-cgroup2-mount
Fix cgroup2 mount for rootless case
2021-04-22 17:04:19 -07:00
Mrunal Patel
b69bd53648 Merge pull request #2869 from dqminh/fix-oss-fuzz
Fix oss-fuzz build
2021-04-20 19:07:52 -07:00
Kir Kolyshkin
0216716c18 tests/int: add a case for cgroupv2 mount
This checks that in-container view of /sys/fs/cgroup does not
contain any extra cgroups (which was the case for rootless
before the previous commit).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:37:54 -07:00
Kir Kolyshkin
5ffcc5683e tests/int: use bfq test with rootless
As the rootless cgroup2 mount is now fixed, we can enable this test
for rootless as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:37:39 -07:00
Liu Hua
aa62272382 tiny fix iterative checkpoint test case
Criu creates symlink named "parent" under image-path while doing
interactive checkpoint, without checking wether symlink points to
right place. This patch corrects related test case.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2021-04-19 20:04:52 +08:00
Kir Kolyshkin
31dd1e499b tests/int: add rootless + host pidns test case
For the fix, see previous commit. Without the fix, this test case fails:

> container_linux.go:380: starting container process caused:
> process_linux.go:545: container init caused: readonly path /proc/bus:
> operation not permitted

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-14 17:35:15 -07:00
Qiang Huang
2d38476c96 Merge pull request #2840 from kolyshkin/ignore-kmem
Ignore kernel memory settings
2021-04-13 09:44:14 +08:00
Kir Kolyshkin
52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00
Aleksa Sarai
64bb59f592 nsenter: improve debug logging
In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.

[@kolyshkin: rebasing; fix formatting via indent for make validate to pass]

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-08 09:54:43 -07:00
Peter Hunt
6ce2d63a5d libct/init_linux: retry chdir to fix EPERM
Alas, the EPERM on chdir saga continues...

Unfortunately, the there were two releases between when 5e0e67d76c  was released
and when the workaround https://github.com/opencontainers/runc/pull/2712 was added.

Between this, folks started relying on the ability to have a workdir that the container user doesn't have access to.

Since this case was previously valid, we should continue support for it.

Now, we retry the chdir:
Once at the top of the function (to catch cases where the runc user has access, but container user does not)
and once after we setup user (to catch cases where the container user has access, and the runc user does not)

Add a test case for this as well.

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2021-04-06 14:51:43 -04:00
Akihiro Suda
c453f1a523 Merge pull request #2881 from kolyshkin/test-rand-cg
tests/int: some refactoring, fix a flake
2021-04-06 13:27:19 +09:00
Shengjing Zhu
c5029c001d tests: fix hello-world tarball name in testdata for arm64
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
2021-04-05 23:55:27 +08:00
Kir Kolyshkin
36fe3cc28c tests/int/cpt: fix lazy-pages flakiness
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:

> (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
> (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
> ...
> (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.

I think what happens is

1. The test runs runc checkpoint in lazy-pages mode in background.
2. The test runs criu lazy-pages in background.
3. The test runs runc restore.

Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.

At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.

At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.

As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.

The remedy here is to change the name of systemd unit to which the
container is restored.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 12:57:53 -07:00
Kir Kolyshkin
0e0890027c tests/int/checkpoint: close lazy_r fd
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 12:57:32 -07:00
Kir Kolyshkin
b09030a5f4 tests/int/checkpoint: close fds in check_pipes
In check_pipes, make sure we
 - close all fds we opened in setup_pipes;
 - check that runc stderr is empty (and fail if it's not).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 12:57:27 -07:00
Kir Kolyshkin
e63df1e6f0 tests/int: really randomize cgroup/unit names
Commit 41670e21f0 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but the randomization was per bats instance, not per test.

Fix this by refactoring init_cgroups_path/set_cgroups_path
(moving variable/random part to set_cgroups_path).

NOTE though that the randomization is only performed for those tests
that explicitly call set_cgroups_path.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 12:19:22 -07:00
Kir Kolyshkin
6e4c5b6e85 tests/int/cgroups: don't use BUSYBOX_BUNDLE
Commit 41670e21f removed BUSYBOX_BUNDLE env var, but c3ffd2ef81
was developed before 41670e21f was merged.

Everything still works because now BUSYBOX_BUNDLE has no value.
Nevertheless, let's remove it to avoid confusion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-01 11:44:53 -07:00
Kir Kolyshkin
bed4d89f57 Merge pull request #2807 from kolyshkin/google-golang-protobuf
go.mod, libct: switch to google.golang.org/protobuf
2021-03-31 20:34:16 -07:00
Kir Kolyshkin
f09a3e1b8d tests/int: don't set/use CGROUP_XXX variables
Helper function init_cgroup_paths sets two sets of cgroup path variables
for cgroup v1 case (below XXX is cgroup controller name, e.g. MEMORY):

1. CGROUP_XXX_BASE_PATH -- path to XXX controller mount point
   (e.g. CGROUP_MEMORY_BASE_PATH=/sys/fs/cgroup/memory);

2. CGROUP_XXX -- path to the particular container XXX controller cgroup
   (e.g. CGROUP_MEMORY=/sys/fs/cgroup/memory/runc-cgroups-integration-test/test-cgroup).

The second set of variables is mostly used by check_cgroup_value(),
with only two exceptions:
 - CGROUP_CPU in @test "update rt period and runtime";
 - few CGROUP_XXX in @test "runc delete --force in cgroupv1 with
   subcgroups".

Remove these variables, as their values are not used much
and are easy to get (as can be seen in modified test cases).

While at it, mark some variables as local.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-31 20:24:46 -07:00