Currently there's no way to distinguish between the two cases:
- runc exec failed;
- the command executed returned 1.
This was possible before commit 8477638aab, as runc exec exited with
the code of 255 if exec itself has failed. The code of 255 is the same
convention as used by e.g. ssh.
Re-introduce the feature, document it, and add some tests so it won't be
broken again.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes using runc with podman on my system (Fedora 34).
> $ podman --runtime `pwd`/runc run --rm --memory 4M fedora echo it works
> Error: unable to start container process: error adding seccomp filter rule for syscall bdflush: permission denied: OCI permission denied
The problem is, libseccomp returns EPERM when a redundant rule (i.e. the
rule with the same action as the default one) is added, and podman (on
my machine) sets the following rules in config.json:
<....>
"seccomp": {
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"bdflush",
"io_pgetevents",
<....>
],
"action": "SCMP_ACT_ERRNO",
"errnoRet": 1
},
<....>
(Note that defaultErrnoRet is not set, but it defaults to 1).
With this commit, it works:
> $ podman --runtime `pwd`/runc run --memory 4M fedora echo it works
> it works
Add an integration test (that fails without the fix).
Similar crun commit:
* https://github.com/containers/crun/commit/08229f3fb904c5ea19a7d9
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As Cirrus CI does not provide a real terminal this uses the same
'ssh -tt' workaround as the Vagrant setup. This sets up the
CentOS 7 and 8 to allow SSH as root to localhost so that we can run
all the tests via 'ssh -tt'.
Not going through vagrant reduces CI times for CentOS 7 and 8 from 6
minutes to 4 minutes.
Signed-off-by: Adrian Reber <areber@redhat.com>
Without this, the test case fails with
> Writing 1000000 to /sys/fs/cgroup/cpu,cpuacct/runc-cgroups-integration-test/cpu.rt_period_us
> /tmp/bats-run-106836/bats.116418.src: line 548: /sys/fs/cgroup/cpu,cpuacct/runc-cgroups-integration-test/cpu.rt_period_us: Permission denied
Since we do not currently have a setup to test this, this went
unnoticed (can be seen in RHEL8 though).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
Since commit f09a3e1b8d, the value passed on to read starts with
a slash, resulting in the first element of the array to be empty.
As a result, the test tries to write to the top-level cgroup, which
fails when rootless:
> # Writing 1000000 to /sys/fs/cgroup/cpu,cpuacct//cpu.rt_period_us
> # /tmp/bats-run-106184/bats.115768.src: line 548: /sys/fs/cgroup/cpu,cpuacct//cpu.rt_period_us: Permission denied
To fix, remove the leading slash.
An alternative fix would be to do "for ((i = 1;" instead of "i = 0", but
that seems less readable.
Fixes: f09a3e1b8d
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Run device update tests on cgroup v2, and add a test verifying that we
don't allow access to devices when we don't intend to.
Signed-off-by: Odin Ugedal <odin@uged.al>
Add CAP_SYSLOG to ensure that /dev/kmsg can be accesses on systems where
the sysctl kernel.dmesg_restrict = 1.
Signed-off-by: Odin Ugedal <odin@uged.al>
The test is failing like this:
not ok 70 runc run --no-pivot must not expose bare /proc
# (in test file tests/integration/no_pivot.bats, line 20)
# `[[ "$output" == *"mount: permission denied"* ]]' failed
# runc spec (status=0):
#
# runc run --no-pivot test_no_pivot (status=1):
# unshare: write error: Operation not permitted
Apparently, a recent kernel commit db2e718a47984b9d prevents
root from doing unshare -r unless it has CAP_SETFPCAP.
Add the capability for this specific test.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. The meaning of SkipDevices is what it is -- do not set any
device-related options.
2. Reverts the part of commit 108ee85b82 which skipped the freeze
when the SkipDevices is set. Apparently, the freeze is needed on
update even if no Device* properties are being set.
3. Add "runc update" to "runc run [device cgroup deny]" test.
Fixes: 752e7a8249
Fixes: 108ee85b82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... and remove the one from tests/integration.
The idea is similar to the one for the test case being removed -- try
updating device rules many times to make sure we are not leaking eBPF
programs after every update/Set(). This is better though as we can
really change the device rules every time (which "runc update" can't)
and check that the rule is applied.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The "runc run [device cgroup allow rm block device]" test calls lsblk
three times to get device name, minor and major number. This creates a
potential problem when the devices are changed between the calls.
Simplify the code by using bash read together with IFS (as there's no
way to have lsblk output MAJOR:MINOR pair without a semicolon).
Note that head -n 1 is not needed as read already reads a single line.
[v2: don't use PATH as CentOS7's lsblk does not support it.]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is to ensure that we aren't leaking eBPF programs after "runc
update". Unfortunately we cannot directly test the behaviour of cgroup
program updates in an integration test because "runc update" doesn't
support that behaviour at the moment.
So instead we rely on the fact that each "runc update" implicitly
triggers the devices rules to be updated. Without the previous patches
applied, this new test will fail with errors (on cgroupv2 systems).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Kir Kolyshkin (8):
tests/int/helpers: rm old code
libct/cg/sd/v2: call initPath from Path
tests/int/update.bats: don't set cpuset in setup
tests/int/helpers: generalize require cgroups_freezer
tests/int: enable/use requires cgroups_<ctrl>
tests/int/cgroups: don't check for hugetlb
tests/int: run rootless_cgroup tests for v2+systemd
Vagrantfile.fedora: set Delegate=yes
LGTMs: AkihiroSuda cyphar
Closes#2944
Before this commit, "require rootless_cgroup" feature check required "cgroup"
to be present in ROOTLESS_FEATURES. The idea of the requirement, though, is
to ensure that rootless runc can manage cgroups.
In case of systemd + cgroup v2, rootless runc can manage cgroups,
thanks to systemd delegation, so modify the feature check accordingly.
Next, convert (simplify) some of the existing users to the modified check.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Systemd is not able to delegate hugetlb controller, and it is needed for
cgroup v2 + systemd + rootless case (which is currently skipped because
of "requires rootless_cgroup", but will be enabled by a later commit).
The failure being fixed looks like this:
> not ok 4 runc create (limits + cgrouppath + permission on the cgroup dir) succeeds
> # (from function `check_cgroup_value' in file /vagrant/tests/integration/helpers.bash, line 188,
> # in test file /vagrant/tests/integration/cgroups.bats, line 53)
> # `check_cgroup_value "cgroup.controllers" "$(cat /sys/fs/cgroup/user.slice/user-$(id -u).slice/cgroup.controllers)"' failed
> # <....>
> # /sys/fs/cgroup/user.slice/user-2000.slice/user@2000.service/machine.slice/runc-cgroups-integration-test-20341.scope/cgroup.controllers
> # current cpuset cpu io memory pids !? cpuset cpu io memory hugetlb pids
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. In case of cgroup v2 + systemd + rootless, get the list of controllers from
the own cgroup, rather than from the root one (as systemd cgroup delegation
is required in this case, and it might not be set or fully working).
2. Use "requires cgroups_<controller>" in tests that need those
controllers.
Tested on Fedora 31 (which does not support delegation of cpuset controller).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Setting cpuset.cpus requires cpuset controller, which might
not be available in case of cgroup v2 + systemd < 244.
Rather than require cpuset support from every test case, do not set the initial
cpuset value from setup(), and do not check it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was added by commit ca4f427af, together with set_resources_limit,
and was needed for the latter, since sed was used to update resources.
Now when we switched to jq in commit 79fe41d3c1, this kludge is no
longer needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
runc resolves symlink before doing bind mount. So
we should save original path while formatting CriuReq for
checkpoint.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
`--parent-path` needs to be relative path of `--image-path`,
and points to previous checkpoint directory.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
This checks that in-container view of /sys/fs/cgroup does not
contain any extra cgroups (which was the case for rootless
before the previous commit).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Criu creates symlink named "parent" under image-path while doing
interactive checkpoint, without checking wether symlink points to
right place. This patch corrects related test case.
Signed-off-by: Liu Hua <weldonliu@tencent.com>
For the fix, see previous commit. Without the fix, this test case fails:
> container_linux.go:380: starting container process caused:
> process_linux.go:545: container init caused: readonly path /proc/bus:
> operation not permitted
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.
[@kolyshkin: rebasing; fix formatting via indent for make validate to pass]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Alas, the EPERM on chdir saga continues...
Unfortunately, the there were two releases between when 5e0e67d76c was released
and when the workaround https://github.com/opencontainers/runc/pull/2712 was added.
Between this, folks started relying on the ability to have a workdir that the container user doesn't have access to.
Since this case was previously valid, we should continue support for it.
Now, we retry the chdir:
Once at the top of the function (to catch cases where the runc user has access, but container user does not)
and once after we setup user (to catch cases where the container user has access, and the runc user does not)
Add a test case for this as well.
Signed-off-by: Peter Hunt <pehunt@redhat.com>
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:
> (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
> (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
> ...
> (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.
I think what happens is
1. The test runs runc checkpoint in lazy-pages mode in background.
2. The test runs criu lazy-pages in background.
3. The test runs runc restore.
Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.
At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.
At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.
As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.
The remedy here is to change the name of systemd unit to which the
container is restored.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In check_pipes, make sure we
- close all fds we opened in setup_pipes;
- check that runc stderr is empty (and fail if it's not).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 41670e21f0 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but the randomization was per bats instance, not per test.
Fix this by refactoring init_cgroups_path/set_cgroups_path
(moving variable/random part to set_cgroups_path).
NOTE though that the randomization is only performed for those tests
that explicitly call set_cgroups_path.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 41670e21f removed BUSYBOX_BUNDLE env var, but c3ffd2ef81
was developed before 41670e21f was merged.
Everything still works because now BUSYBOX_BUNDLE has no value.
Nevertheless, let's remove it to avoid confusion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Helper function init_cgroup_paths sets two sets of cgroup path variables
for cgroup v1 case (below XXX is cgroup controller name, e.g. MEMORY):
1. CGROUP_XXX_BASE_PATH -- path to XXX controller mount point
(e.g. CGROUP_MEMORY_BASE_PATH=/sys/fs/cgroup/memory);
2. CGROUP_XXX -- path to the particular container XXX controller cgroup
(e.g. CGROUP_MEMORY=/sys/fs/cgroup/memory/runc-cgroups-integration-test/test-cgroup).
The second set of variables is mostly used by check_cgroup_value(),
with only two exceptions:
- CGROUP_CPU in @test "update rt period and runtime";
- few CGROUP_XXX in @test "runc delete --force in cgroupv1 with
subcgroups".
Remove these variables, as their values are not used much
and are easy to get (as can be seen in modified test cases).
While at it, mark some variables as local.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>