33 Commits

Author SHA1 Message Date
Kir Kolyshkin
6ede591761 internal/systemd: simplify
Remove unused code and argument from the ActivationFiles,
and simplify its usage.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-08 15:34:58 -08:00
Kir Kolyshkin
ba9e60f7a8 Remove crypto/tls dependency
It appears that when we import github.com/coreos/go-systemd/activation,
it brings in the whole crypto/tls package (which is not used by runc
directly or indirectly), making the runc binary size larger and
potentially creating issues with FIPS compliance.

Let's copy the code of function we use from go-systemd/activation
to avoid that.

The space savings are:

$ size runc.before runc.after
   text	   data	    bss	    dec	    hex	filename
7101084	5049593	 271560	12422237	 bd8c5d	runc.before
6508796	4623281	 229128	11361205	 ad5bb5	runc.after

Reported-by: Dimitri John Ledkov <dimitri.ledkov@surgut.co.uk>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-08 15:31:42 -08:00
Akihiro Suda
64c3c8eea6 Merge pull request #4994 from kolyshkin/gofumpt-extra
Enable gofumpt extra rules
2025-11-28 09:30:57 +09:00
Aleksa Sarai
195e9551e4 pathrs: add MkdirAllParentInRoot helper
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.

The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
cfb74326be pathrs: add "hallucination" helpers for SecureJoin magic
In order to maintain compatibility with previous releases of runc (which
permitted dangling symlinks as path components by permitting
non-existent path components to be treated like real directories) we
have to first do SecureJoin to construct a target path that is
compatible with the old behaviour but has all dangling symlinks (or
other invalid paths like ".." components after non-existent directories)
removed.

This is effectively a more generic verison of commit 3f925525b4
("rootfs: re-allow dangling symlinks in mount targets") and will let us
remove the need for open-coding SecureJoin workarounds.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
20c5a8ec4a pathrs: rename MkdirAllInRootOpen -> MkdirAllInRoot
Now that MkdirAllInRoot has been removed, we can make MkdirAllInRootOpen
less wordy by renaming it to MkdirAllInRoot. This is a non-functional
change.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:04 +11:00
Aleksa Sarai
9dbd37e06f libct: switch final WithProcfd users to WithProcfdFile
This probably should've been done as part of commit d40b3439a9
("rootfs: switch to fd-based handling of mountpoint targets") but it
seems I missed them when doing the rest of the conversions.

This also lets us remove utils.WithProcfd entirely, as well as
pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the
externally-importable "libcontainer/utils" package and so we need to
have a deprecation notice to remove it in runc 1.5.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:30 +11:00
Aleksa Sarai
42a1e19d67 libcontainer: move CleanPath and StripRoot to internal/pathrs
These helpers will be needed for the compatibility code added in future
patches in this series, but because "internal/pathrs" is imported by
"libcontainer/utils" we need to move them so that we can avoid circular
dependencies.

Because the old functions were in a non-internal package it is possible
some downstreams use them, so add some wrappers but mark them as
deprecated.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:29 +11:00
Kir Kolyshkin
67840cce4b Enable gofumpt extra rules
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.

Enable gofumpt "extra" rules, and reformat the code accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-10 13:18:45 -08:00
Aleksa Sarai
a0e809a8ba libct: switch to unix.SetMemPolicy wrapper
This is mostly a mechanical change, but we also need to change some
types to match the "mode int" argument that golang.org/x/sys/unix
decided to use.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-10 16:03:02 +11:00
Aleksa Sarai
96f1962f91 deps: update to github.com/opencontainers/selinux@v0.13.0
This new version includes the fixes for CVE-2025-52881, so we can remove
the internal/third_party copy of the library we added in commit
ed6b1693b8 ("selinux: use safe procfs API for labels") as well as the
"replace" directive in go.mod (which is problematic for "go get"
installs).

Fixes: ed6b1693b8 ("selinux: use safe procfs API for labels")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-08 02:14:38 +11:00
Aleksa Sarai
a41366e740 openat2: improve resilience on busy systems
Previously, we would see a ~3% failure rate when starting containers
with mounts that contain ".." (which can trigger -EAGAIN). To counteract
this, filepath-securejoin v0.5.1 includes a bump of the internal retry
limit from 32 to 128, which lowers the failure rate to 0.12%.

However, there is still a risk of spurious failure on regular systems.
In order to try to provide more resilience (while avoiding DoS attacks),
this patch also includes an additional retry loop that terminates based
on a deadline rather than retry count. The deadline is 2ms, as my
testing found that ~800us for a single pathrs operation was the longest
latency due to -EAGAIN retries, and that was an outlier compared to the
more common ~400us latencies -- so 2ms should be more than enough for
any real system.

The failure rates above were based on more 50k runs of runc with an
attack script (from libpathrs) running a rename attack on all cores of a
16-core system, which is arguably a worst-case but heavily utilised
servers could likely approach similar results.

Tested-by: Phil Estes <estesp@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-05 18:57:51 +11:00
Aleksa Sarai
ed6b1693b8 selinux: use safe procfs API for labels
Due to the sensitive nature of these fixes, it was not possible to
submit these upstream and vendor the upstream library. Instead, this
patch uses a fork of github.com/opencontainers/selinux, branched at
commit opencontainers/selinux@879a755db5.

In order to permit downstreams to build with this patched version, a
snapshot of the forked version has been included in
internal/third_party/selinux. Note that since we use "go mod vendor",
the patched code is usable even without being "go get"-able. Once the
embargo for this issue is lifted we can submit the patches upstream and
switch back to a proper upstream go.mod entry.

Also, this requires us to temporarily disable the CI job we have that
disallows "replace" directives.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:06 +11:00
Aleksa Sarai
d40b3439a9 rootfs: switch to fd-based handling of mountpoint targets
An attacker could race with us during mount configuration in order to
trick us into mounting over an unexpected path. This would bypass
checkProcMount() and would allow for security profiles to be left
unapplied by mounting over /proc/self/attr/... (or even more serious
outcomes such as killing the entire system by tricking runc into writing
strings to /proc/sysrq-trigger).

This is a larger issue with our current mount infrastructure, and the
ideal solution would be to rewrite it all to be fd-based (which would
also allow us to support the "new" mount API, which also avoids a bunch
of other issues with mount(8)). However, such a rewrite is not really
workable as a security fix, so this patch is a bit of a compromise
approach to fix the issue while also moving us a bit towards that
eventual end-goal.

The core issue in CVE-2025-52881 is that we currently use the (insecure)
SecureJoin to re-resolve mountpoint target paths multiple times during
mounting. Rather than generating a string from createMountpoint(), we
instead open an *os.File handle to the target mountpoint directly and
then operate on that handle. This will make it easier to remove
utils.WithProcfd() and rework mountViaFds() in the future.

The only real issue we need to work around is that we need to re-open
the mount target after doing the mount in order to get a handle to the
mountpoint -- pathrs.Reopen() doesn't work in this case (it just
re-opens the inode under the mountpoint) so we need to do a naive
re-open using the full path. Note that if we used move_mount(2) this
wouldn't be a problem because we would have a handle to the mountpoint
itself.

Note that this is still somewhat of a temporary solution -- ideally
mountViaFds would use *os.File directly to let us avoid some other
issues with using bare /proc/... paths, as well as also letting us more
easily use the new mount API on modern kernels.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:06 +11:00
Aleksa Sarai
77d217c7c3 init: write sysctls using safe procfs API
sysctls could in principle also be used as a write gadget for arbitrary
procfs files. As this requires getting a non-subset=pid /proc handle we
amortise this by only allocating a single procfs handle for all sysctl
writes.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
ff6fe13246 utils: use safe procfs for /proc/self/fd loop code
From a safety perspective this might not be strictly required, but it
paves the way for us to remove utils.ProcThreadSelf.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
01de9d65dc rootfs: avoid using os.Create for new device inodes
If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).

The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
77889b56db internal: add wrappers for securejoin.Proc*
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
44a0fcf685 go.mod: update to github.com/cyphar/filepath-securejoin@v0.5.0
In order to avoid lint errors due to the deprecation of the top-level
securejoin methods ported from libpathrs, we need to adjust
internal/pathrs to use the new pathrs-lite subpackage instead.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Aleksa Sarai
531ef794e4 console: use TIOCGPTPEER when allocating peer PTY
When opening the peer end of a pty, the old kernel API required us to
open /dev/pts/$num inside the container (at least since we fixed console
handling many years ago in commit 244c9fc426 ("*: console rewrite")).

The problem is that in a hostile container it is possible for
/dev/pts/$num to be an attacker-controlled symlink that runc can be
tricked into resolving when doing bind-mounts. This allows the attacker
to (among other things) persist /proc/... entries that are later masked
by runc, allowing an attacker to escape through the kernel.core_pattern
sysctl (/proc/sys/kernel/core_pattern). This is the original issue
reported by Lei Wang and Li Fu Bang in CVE-2025-52565.

However, it should be noted that this is not entirely a newly-discovered
problem. Way back in Linux 4.13 (2017), I added the TIOCGPTPEER ioctl,
which allows us to get a pty peer without touching the /dev/pts inside
the container. The original threat model was around an attacker
replacing /dev/pts/$n or /dev/pts/ptmx with some malicious inode (a DoS
inode, or possibly a PTY they wanted a confused deputy to operate on).
Unfortunately, there was no practical way for runc to cache a safe
O_PATH handle to /dev/pts/ptmx (unlike other runtimes like LXC, which
switched to TIOCGPTPEER way back in 2017). Since it wasn't clear how we
could protect against the main attack TIOCGPTPEER was meant to protect
against, we never switched to it (even though I implemented it
specifically to harden container runtimes).

Unfortunately, It turns out that mount *sources* are a threat we didn't
fully consider. Since TIOCGPTPEER already solves this problem entirely
for us in a race free way, we should just use that. In a later patch, we
will add some hardening for /dev/pts/$num opening to maintain support
for very old kernels (Linux 4.13 is very old at this point, but RHEL 7
is still kicking and is stuck on Linux 3.10).

Fixes: GHSA-qw9x-cqr3-wc7r CVE-2025-52565
Reported-by: Lei Wang <ssst0n3@gmail.com> (CVE-2025-52565)
Reported-by: lfbzhm <lifubang@acmcoder.com> (CVE-2025-52565)
Reported-by: Aleksa Sarai <cyphar@cyphar.com> (TIOCGPTPEER)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Aleksa Sarai
ff94f9991b *: switch to safer securejoin.Reopen
filepath-securejoin v0.3 gave us a much safer re-open primitive, we
should use it to avoid any theoretical attacks. Rather than using it
direcly, add a small pathrs wrapper to make libpathrs migrations in the
future easier...

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Aleksa Sarai
6fc1914491 internal: move utils.MkdirAllInRoot to internal/pathrs
We will have more wrappers around filepath-securejoin, and so move them
to their own specific package so that we can eventually use libpathrs
fairly cleanly (by swapping out the implementation).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Aleksa Sarai
db19bbed53 internal/sys: add VerifyInode helper
This will be used for a few security patches in later patches in this
patchset. The need to verify what kind of inode we are operating on in a
race-free way turns out to be quite a common pattern...

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:01 +11:00
Aleksa Sarai
a672a5f36c merge #4726 into opencontainers/runc:main
Antti Kervinen (1):
  Add memory policy support

LGTMs: lifubang AkihiroSuda cyphar
2025-10-08 05:18:13 +11:00
Antti Kervinen
eda7bdf80c Add memory policy support
Implement support for Linux memory policy in OCI spec PR:
https://github.com/opencontainers/runtime-spec/pull/1282

Signed-off-by: Antti Kervinen <antti.kervinen@intel.com>
2025-10-07 15:06:37 +03:00
Aleksa Sarai
627054d246 lint/revive: add package doc comments
This silences all of the "should have a package comment" lint warnings
from golangci-lint.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-10-03 15:17:43 +10:00
Kir Kolyshkin
491326cdeb int/linux: add/use Recvfrom
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
e655abc0da int/linux: add/use Dup3, Open, Openat
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
c690b66d7f int/linux: add/use Exec
Drop the libcontainer/system/exec, and use the linux.Exec instead.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
431b8bb4d8 int/linux: add/use Getwd
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
8cc1eb379b Introduce and use internal/linux
This package is to provide unix.* wrappers to ensure that:
 - they retry on EINTR;
 - a "rich" error is returned on failure.

 A first such wrapper, Sendmsg, is introduced.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:50 -07:00
Kir Kolyshkin
9cb59b4659 ci: rm "skip on CentOS 7" kludges
We no longer test on CentOS 7.

Remove the internal/testutil package as it has no other uses.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-11-07 13:16:16 -08:00
Kir Kolyshkin
a2f7c6add8 internal/testutil: create, add SkipOnCentOS
CentOS 7 is showing its age and we'd rather skip some tests on it than
find out why they are flaky.

Add internal/testutil package, and move the generalized version of
SkipOnCentOS7 from libcontainer/cgroups/devices to there.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-10-30 16:54:17 -07:00