It appears that when we import github.com/coreos/go-systemd/activation,
it brings in the whole crypto/tls package (which is not used by runc
directly or indirectly), making the runc binary size larger and
potentially creating issues with FIPS compliance.
Let's copy the code of function we use from go-systemd/activation
to avoid that.
The space savings are:
$ size runc.before runc.after
text data bss dec hex filename
7101084 5049593 271560 12422237 bd8c5d runc.before
6508796 4623281 229128 11361205 ad5bb5 runc.after
Reported-by: Dimitri John Ledkov <dimitri.ledkov@surgut.co.uk>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.
The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In order to maintain compatibility with previous releases of runc (which
permitted dangling symlinks as path components by permitting
non-existent path components to be treated like real directories) we
have to first do SecureJoin to construct a target path that is
compatible with the old behaviour but has all dangling symlinks (or
other invalid paths like ".." components after non-existent directories)
removed.
This is effectively a more generic verison of commit 3f925525b4
("rootfs: re-allow dangling symlinks in mount targets") and will let us
remove the need for open-coding SecureJoin workarounds.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Now that MkdirAllInRoot has been removed, we can make MkdirAllInRootOpen
less wordy by renaming it to MkdirAllInRoot. This is a non-functional
change.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This probably should've been done as part of commit d40b3439a9
("rootfs: switch to fd-based handling of mountpoint targets") but it
seems I missed them when doing the rest of the conversions.
This also lets us remove utils.WithProcfd entirely, as well as
pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the
externally-importable "libcontainer/utils" package and so we need to
have a deprecation notice to remove it in runc 1.5.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These helpers will be needed for the compatibility code added in future
patches in this series, but because "internal/pathrs" is imported by
"libcontainer/utils" we need to move them so that we can avoid circular
dependencies.
Because the old functions were in a non-internal package it is possible
some downstreams use them, so add some wrappers but mark them as
deprecated.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.
Enable gofumpt "extra" rules, and reformat the code accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is mostly a mechanical change, but we also need to change some
types to match the "mode int" argument that golang.org/x/sys/unix
decided to use.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This new version includes the fixes for CVE-2025-52881, so we can remove
the internal/third_party copy of the library we added in commit
ed6b1693b8 ("selinux: use safe procfs API for labels") as well as the
"replace" directive in go.mod (which is problematic for "go get"
installs).
Fixes: ed6b1693b8 ("selinux: use safe procfs API for labels")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Previously, we would see a ~3% failure rate when starting containers
with mounts that contain ".." (which can trigger -EAGAIN). To counteract
this, filepath-securejoin v0.5.1 includes a bump of the internal retry
limit from 32 to 128, which lowers the failure rate to 0.12%.
However, there is still a risk of spurious failure on regular systems.
In order to try to provide more resilience (while avoiding DoS attacks),
this patch also includes an additional retry loop that terminates based
on a deadline rather than retry count. The deadline is 2ms, as my
testing found that ~800us for a single pathrs operation was the longest
latency due to -EAGAIN retries, and that was an outlier compared to the
more common ~400us latencies -- so 2ms should be more than enough for
any real system.
The failure rates above were based on more 50k runs of runc with an
attack script (from libpathrs) running a rename attack on all cores of a
16-core system, which is arguably a worst-case but heavily utilised
servers could likely approach similar results.
Tested-by: Phil Estes <estesp@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Due to the sensitive nature of these fixes, it was not possible to
submit these upstream and vendor the upstream library. Instead, this
patch uses a fork of github.com/opencontainers/selinux, branched at
commit opencontainers/selinux@879a755db5.
In order to permit downstreams to build with this patched version, a
snapshot of the forked version has been included in
internal/third_party/selinux. Note that since we use "go mod vendor",
the patched code is usable even without being "go get"-able. Once the
embargo for this issue is lifted we can submit the patches upstream and
switch back to a proper upstream go.mod entry.
Also, this requires us to temporarily disable the CI job we have that
disallows "replace" directives.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
An attacker could race with us during mount configuration in order to
trick us into mounting over an unexpected path. This would bypass
checkProcMount() and would allow for security profiles to be left
unapplied by mounting over /proc/self/attr/... (or even more serious
outcomes such as killing the entire system by tricking runc into writing
strings to /proc/sysrq-trigger).
This is a larger issue with our current mount infrastructure, and the
ideal solution would be to rewrite it all to be fd-based (which would
also allow us to support the "new" mount API, which also avoids a bunch
of other issues with mount(8)). However, such a rewrite is not really
workable as a security fix, so this patch is a bit of a compromise
approach to fix the issue while also moving us a bit towards that
eventual end-goal.
The core issue in CVE-2025-52881 is that we currently use the (insecure)
SecureJoin to re-resolve mountpoint target paths multiple times during
mounting. Rather than generating a string from createMountpoint(), we
instead open an *os.File handle to the target mountpoint directly and
then operate on that handle. This will make it easier to remove
utils.WithProcfd() and rework mountViaFds() in the future.
The only real issue we need to work around is that we need to re-open
the mount target after doing the mount in order to get a handle to the
mountpoint -- pathrs.Reopen() doesn't work in this case (it just
re-opens the inode under the mountpoint) so we need to do a naive
re-open using the full path. Note that if we used move_mount(2) this
wouldn't be a problem because we would have a handle to the mountpoint
itself.
Note that this is still somewhat of a temporary solution -- ideally
mountViaFds would use *os.File directly to let us avoid some other
issues with using bare /proc/... paths, as well as also letting us more
easily use the new mount API on modern kernels.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
sysctls could in principle also be used as a write gadget for arbitrary
procfs files. As this requires getting a non-subset=pid /proc handle we
amortise this by only allocating a single procfs handle for all sysctl
writes.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
From a safety perspective this might not be strictly required, but it
paves the way for us to remove utils.ProcThreadSelf.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).
The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In order to avoid lint errors due to the deprecation of the top-level
securejoin methods ported from libpathrs, we need to adjust
internal/pathrs to use the new pathrs-lite subpackage instead.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When opening the peer end of a pty, the old kernel API required us to
open /dev/pts/$num inside the container (at least since we fixed console
handling many years ago in commit 244c9fc426 ("*: console rewrite")).
The problem is that in a hostile container it is possible for
/dev/pts/$num to be an attacker-controlled symlink that runc can be
tricked into resolving when doing bind-mounts. This allows the attacker
to (among other things) persist /proc/... entries that are later masked
by runc, allowing an attacker to escape through the kernel.core_pattern
sysctl (/proc/sys/kernel/core_pattern). This is the original issue
reported by Lei Wang and Li Fu Bang in CVE-2025-52565.
However, it should be noted that this is not entirely a newly-discovered
problem. Way back in Linux 4.13 (2017), I added the TIOCGPTPEER ioctl,
which allows us to get a pty peer without touching the /dev/pts inside
the container. The original threat model was around an attacker
replacing /dev/pts/$n or /dev/pts/ptmx with some malicious inode (a DoS
inode, or possibly a PTY they wanted a confused deputy to operate on).
Unfortunately, there was no practical way for runc to cache a safe
O_PATH handle to /dev/pts/ptmx (unlike other runtimes like LXC, which
switched to TIOCGPTPEER way back in 2017). Since it wasn't clear how we
could protect against the main attack TIOCGPTPEER was meant to protect
against, we never switched to it (even though I implemented it
specifically to harden container runtimes).
Unfortunately, It turns out that mount *sources* are a threat we didn't
fully consider. Since TIOCGPTPEER already solves this problem entirely
for us in a race free way, we should just use that. In a later patch, we
will add some hardening for /dev/pts/$num opening to maintain support
for very old kernels (Linux 4.13 is very old at this point, but RHEL 7
is still kicking and is stuck on Linux 3.10).
Fixes: GHSA-qw9x-cqr3-wc7r CVE-2025-52565
Reported-by: Lei Wang <ssst0n3@gmail.com> (CVE-2025-52565)
Reported-by: lfbzhm <lifubang@acmcoder.com> (CVE-2025-52565)
Reported-by: Aleksa Sarai <cyphar@cyphar.com> (TIOCGPTPEER)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
filepath-securejoin v0.3 gave us a much safer re-open primitive, we
should use it to avoid any theoretical attacks. Rather than using it
direcly, add a small pathrs wrapper to make libpathrs migrations in the
future easier...
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We will have more wrappers around filepath-securejoin, and so move them
to their own specific package so that we can eventually use libpathrs
fairly cleanly (by swapping out the implementation).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This will be used for a few security patches in later patches in this
patchset. The need to verify what kind of inode we are operating on in a
race-free way turns out to be quite a common pattern...
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This package is to provide unix.* wrappers to ensure that:
- they retry on EINTR;
- a "rich" error is returned on failure.
A first such wrapper, Sendmsg, is introduced.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
CentOS 7 is showing its age and we'd rather skip some tests on it than
find out why they are flaky.
Add internal/testutil package, and move the generalized version of
SkipOnCentOS7 from libcontainer/cgroups/devices to there.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>