Instead of generating a list of tmpfs mount and have a special function
to check whether the path is in the list, let's go over the list of
mounts directly. This simplifies the code and improves readability.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit ce3cd4234c)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since its code is now trivial, and it is only called from a single
place, it does not make sense to have it as a separate function.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit f91fbd34d9)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It makes sense to ignore cgroup mounts much early in the code,
saving some time on unnecessary operations.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit b8aa5481db)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Replace the big "if !" block with the if block and continue,
simplifying the code flow.
2. Move comments closer to the code, improving readability.
This commit is best reviewed with --ignore-all-space or similar.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 0c93d41c65)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.
The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but kcommit da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.
The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
"runc exec", if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.
For some reason, in our CI it seems that rootless --systemd-cgroup
results in the cpuset (presumably temporarily?) being configured such
that sched_setaffinity(2) will allow the full set of CPUs. For this
particular case, all we care about is that it is different to the
original set, so include some special-casing (but we should probably
investigate this further...).
Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Martin Sivak <msivak@redhat.com>
Reported-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(Cherry-pick of commit 121192ade6c55f949d32ba486219e2b1d86898b2.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Sometimes we need to run runc through some wrapper (like nohup), but
because "__runc" and "runc" are bash functions in our test suite this
doesn't work trivially -- and you cannot just pass "$RUNC" because you
you need to set --root for rootless tests.
So create a setup_runc_cmdline helper which sets $RUNC_CMDLINE to the
beginning cmdline used by __runc (and switch __runc to use that).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(Cherry-pick of commit d1f6acfab06e6f5eb15b7edfaa704f50907907b1.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
"runc" was a special wrapper around bats's "run" which output some very
useful diagnostic information to the bats log, but this was not usable
for other commands. So let's make it a more generic helper that we can
use for other commands.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(Cherry-pick of commit ea385de40c9a006737399bc72918a19e5d038736.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The issue on arm [1] is now fixed, so let's get back to using the
packaged criu version for most of the CI matrix.
This reverts commit 105674844e
("ci: use criu built from source on gha arm").
[1]: https://github.com/checkpoint-restore/criu/issues/2709
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(Cherry-picked from commit 96f4a90a6b1ca9e3f2011ebaeffb7dc52db2ca32.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Currently, criu package from opensuse build farm times out on GHA arm,
so let's only use criu-dev (i.e. compiled from source on CI machine).
Once this is fixed, this patch can be reverted.
Related to criu issue 2709.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(Cherry-picked from commit 105674844eaaf24bf14135ef0c64703e511882ab.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Since GHA now provides ARM, we can switch away from actuated.
Many thanks to @alexellis (@self-actuated) for being the sponsor of this
project.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(Cherry-picked from commit 1cf096803abb770c414ce0a1e2e0be283b09001d.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Prevent --l3-cache-schema from clearing the intel_rdt.memBwSchema state
and --mem-bw-schema clearing l3_cache_schema, respectively.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
(cherry picked from commit 57b6a317bb)
This was added in 2ee9cbbd12 ("It's /proc/stat, not /proc/stats") with
no actual justification, and doesn't really make much sense on further
inspection:
* /proc/net is a symlink to "self/net", which means that /proc/net/dev
is a per-process file, and so overmounting it would only affect pid1.
Any other program that cares about /proc/net/dev would see their own
process's configuration, and unprivileged processes wouldn't be able
to see /proc/1/... data anyway.
In addition, the fact that this is a symlink means that runc will
deny the overmount because /proc/1/net/dev is not in the proc
overmount allowlist. This means that this has not worked for many
years, and probably never worked in the first place.
* /proc/self/net is already namespaced with network namespaces, so the
primary argument for allowing /proc overmounts (lxcfs-like masking of
procfs files to emulate namespacing for files that are not properly
namespaced for containers -- such as /proc/cpuinfo) is moot.
It goes without saying that lxcfs has never overmounted
/proc/self/net/... files, so the general "because lxcfs"
justification doesn't hold water either.
* The kernel has slowly been moving towards blocking overmounts in
/proc/self/. Linux 6.12 blocked overmounts for fd, fdinfo, and
map_files; future Linux versions will probably end up blocking
everything under /proc/self/.
Fixes: 2ee9cbbd12 ("It's /proc/stat, not /proc/stats")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
(cherry-picked from commit 3620185d06b79da836559b75161027c6273fff7b.)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The dmem controller is added into kernel v6.13 and is now enabled in
Fedora 42 kernels. Yet, systemd is not aware of dmem.
This fixes the test case failure on Fedora.
For the initial test case, see commit 27515719.
For earlier commits similar to this one, see
commits 601cf582, 05272718, e83ca519.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit b3432118ed)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
While debugging an issue involving failing mounts, I discovered that
just returning the plain mount error message when we are in the fallback
code for handling locked mounts leads to unnecessary confusion.
It also doesn't help that podman currently forcefully sets "rw" on
mounts, which means that rootless containers are likely to hit the
locked mounts issue fairly often.
So we should improve our error messages to explain why the mount is
failing in the locked flags case.
Fixes: 7c71a22705 ("rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT")
(cherry picked from commit 58c3ab77b0)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When reading mount errors, it is quite hard to make sense of mount flags
in their hex form. As this is the error path, the minor performance
impact of constructing a string is probably not worth hyper-optimising.
(cherry pick from commit 30302a2850)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
For some reason, ssh-keygen is unable to write to /root even as root on
AlmaLinux 8:
# id
uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:initrc_t:s0
# id -Z
ls -ld /root
# ssh-keygen -t ecdsa -N "" -f /root/rootless.key || cat /var/log/audit/audit.log
Saving key "/root/rootless.key" failed: Permission denied
The audit.log shows:
> type=AVC msg=audit(1744834995.352:546): avc: denied { dac_override } for pid=13471 comm="ssh-keygen" capability=1 scontext=system_u:system_r:ssh_keygen_t:s0 tcontext=system_u:system_r:ssh_keygen_t:s0 tclass=capability permissive=0
> type=SYSCALL msg=audit(1744834995.352:546): arch=c000003e syscall=257 success=no exit=-13 a0=ffffff9c a1=5641c7587520 a2=241 a3=180 items=0 ppid=4978 pid=13471 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ssh-keygen" exe="/usr/bin/ssh-keygen" subj=system_u:system_r:ssh_keygen_t:s0 key=(null)␝ARCH=x86_64 SYSCALL=openat AUID="unset" UID="root" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"
A workaround is to use /root/.ssh directory instead of just /root.
While at it, let's unify rootless user and key setup into a single place.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 87ae2f8466)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We are seeing a ton on flakes on almalinux-8 CI job, all caused by criu
inability to freeze a cgroup. This was worked around in criu [1], but
obviously we can't rely on a distro vendor to update the package.
Let's use a copr (thanks to Adrian Reber!)
[1]: https://github.com/checkpoint-restore/criu/pull/2545
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit b520f750ef)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This makes the code more robust and allows to remove the
"shellcheck disable=SC2086" annotation.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 8e653e40c6)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. There is no need to have -p option in mkdir here, since
/home/rootless was already created by useradd above.
2. When there is no -p, there is no need to suppress the shellcheck
warning (which looked like this):
> In script/setup_host_fedora.sh line 21:
> mkdir -m 0700 -p /home/rootless/.ssh
> ^-- SC2174 (warning): When used with -p, -m only applies to the deepest directory.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit a76a1361b4)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those are no longer needed with shellcheck v0.10.0 (possibly with an
earlier version, too, but I am too lazy to check that).
While at it, fix a typo in the comment.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit af386d1df1)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This issue was originally reported in podman PR 25792.
When calling runc pause/unpause for an ordinary user, podman do not
provide --systemd-cgroups option, and shouldUseRootlessCgroupManager
returns true. This results in a warning:
$ podman pause sleeper
WARN[0000] runc pause may fail if you don't have the full access to cgroups
sleeper
Actually, it does not make sense to call shouldUseRootlessCgroupManager
at this point, because we already know if we're rootless or not, from
the container state.json (same for systemd).
Also, busctl binary is not available either in this context, so
shouldUseRootlessCgroupManager would not work properly.
Finally, it doesn't really matter if we use systemd or not, because we
use fs/fs2 manager to freeze/unfreeze, and it will return something like
EPERM (or tell that cgroups is not configured, for a true rootless
container).
So, let's only print the warning after pause/unpause failed,
if the error returned looks like a permission error.
Same applies to "runc ps".
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit c5ab4b6e30)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is to simplify code review for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit fda034c9ec)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since v3.14, CRIU always restores processes into a time namespace to
prevent backward jumps of monotonic and boottime clocks. This change
updates the container configuration to ensure that `runc exec` launches
new processes within the container's time namespace.
Fixes#2610
Signed-off-by: Andrei Vagin <avagin@gmail.com>
(cherry picked from commit b68cbdff34)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In a nutshell:
- use git-core instead of git;
- do not install weak deps;
- do not install docs.
This results in less packages to install:
- 25 instead of 72 for almalinux-8
- 24 instead of 90 for almalinux-9
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 1d9bea5378)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
- Unlike proprietary Vagrant, Lima remains to be an open source project
- GHA now natively supports nested virt on Linux runners
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 135552e5e4)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>