Alas, these new constants are already in v1.4.0 release so we can't
remove those right away, but we can mark them as deprecated now
and target removal for v1.5.0.
So,
- mark them as deprecated;
- redefine via unix.MPOL_* counterparts;
- fix the validator code to use unix.MPOL_* directly.
This amends commit a0e809a8.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function calls Init what normally never returns, so the defer only
works if there is an error and we can safely use it to close those fds
we opened. This was done for most but not all fds.
Reported in issue 5008.
Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.
The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In order to maintain compatibility with previous releases of runc (which
permitted dangling symlinks as path components by permitting
non-existent path components to be treated like real directories) we
have to first do SecureJoin to construct a target path that is
compatible with the old behaviour but has all dangling symlinks (or
other invalid paths like ".." components after non-existent directories)
removed.
This is effectively a more generic verison of commit 3f925525b4
("rootfs: re-allow dangling symlinks in mount targets") and will let us
remove the need for open-coding SecureJoin workarounds.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Now that MkdirAllInRoot has been removed, we can make MkdirAllInRootOpen
less wordy by renaming it to MkdirAllInRoot. This is a non-functional
change.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This probably should've been done as part of commit d40b3439a9
("rootfs: switch to fd-based handling of mountpoint targets") but it
seems I missed them when doing the rest of the conversions.
This also lets us remove utils.WithProcfd entirely, as well as
pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the
externally-importable "libcontainer/utils" package and so we need to
have a deprecation notice to remove it in runc 1.5.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
These helpers will be needed for the compatibility code added in future
patches in this series, but because "internal/pathrs" is imported by
"libcontainer/utils" we need to move them so that we can avoid circular
dependencies.
Because the old functions were in a non-internal package it is possible
some downstreams use them, so add some wrappers but mark them as
deprecated.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().
The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when `runc init` fails, we return the error from the parent (which is
usually not too helpful, for example):
runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:
FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state
ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.
This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:
ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state
Yes, it is long and ugly, but at least the upper level runtime will
report it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the recent CVE fixes, TestFdLeaksSystemd sometimes fails:
=== RUN TestFdLeaksSystemd
exec_test.go:1750: extra fd 9 -> /12224/task/13831/fd
exec_test.go:1753: found 1 extra fds after container.Run
--- FAIL: TestFdLeaksSystemd (0.10s)
It might have been caused by the change to the test code in commit
ff6fe13 ("utils: use safe procfs for /proc/self/fd loop code") -- we are
now opening a file descriptor during the logic to get a list of file
descriptors. If the file descriptor happens to be allocated to a
different number, you'll get an error.
Let's try to filter out the fd used to read a directory.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The main update is actually in github.com/opencontainers/cgroups, but we
need to also update runtime-spec to a newer pre-release version to get
the updates from there as well.
In short, the behaviour change is now that "0" is treated as a valid
value to set in "pids.max", "-1" means "max" and unset/nil means "do
nothing". As described in the opencontainers/cgroups PR, this change is
actually backwards compatible because our internal state.json stores
PidsLimit, and that entry is marked as "omitempty". So, an old runc
would omit PidsLimit=0 in state.json, and this will be parsed by a new
runc as being "nil" -- and both would treat this case as "do not set
anything".
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.
Enable gofumpt "extra" rules, and reformat the code accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is mostly a mechanical change, but we also need to change some
types to match the "mode int" argument that golang.org/x/sys/unix
decided to use.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This was always the intended behaviour but commit 72fbb34f50 ("rootfs:
switch to fd-based handling of mountpoint targets") regressed it when
adding a mechanism to create a file handle to the target if it didn't
already exist (causing the later stat to always succeed).
A lot of people depend on this functionality, so add some tests to make
sure we don't break it in the future.
Fixes: 72fbb34f50 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
On MIPS arches, Rdev is uint32 so we have to convert it.
Fixes issue 4962.
Fixes: 8476df83 ("libct: add/use isDevNull, verifyDevNull")
Fixes: de87203e ("console: verify /dev/pts/ptmx before use")
Fixes: 398955bc ("console: add fallback for pre-TIOCGPTPEER kernels")
Reported-by: Tianon Gravi <admwiggin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It seems there are a fair few images where dangling symlinks are used as
path components for mount targets, which pathrs-lite does not support
(and it would be difficult to fully support this in a race-free way).
This was actually meant to be blocked by commit 63c2908164 ("rootfs:
try to scope MkdirAll to stay inside the rootfs"), followed by commit
dd827f7b71 ("utils: switch to securejoin.MkdirAllHandle"). However, we
still used SecureJoin to construct mountpoint targets, which means that
dangling symlinks were "resolved" before reaching pathrs-lite.
This patch basically re-adds this hack in order to reduce the breakages
we've seen so far.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
An attacker could race with us during mount configuration in order to
trick us into mounting over an unexpected path. This would bypass
checkProcMount() and would allow for security profiles to be left
unapplied by mounting over /proc/self/attr/... (or even more serious
outcomes such as killing the entire system by tricking runc into writing
strings to /proc/sysrq-trigger).
This is a larger issue with our current mount infrastructure, and the
ideal solution would be to rewrite it all to be fd-based (which would
also allow us to support the "new" mount API, which also avoids a bunch
of other issues with mount(8)). However, such a rewrite is not really
workable as a security fix, so this patch is a bit of a compromise
approach to fix the issue while also moving us a bit towards that
eventual end-goal.
The core issue in CVE-2025-52881 is that we currently use the (insecure)
SecureJoin to re-resolve mountpoint target paths multiple times during
mounting. Rather than generating a string from createMountpoint(), we
instead open an *os.File handle to the target mountpoint directly and
then operate on that handle. This will make it easier to remove
utils.WithProcfd() and rework mountViaFds() in the future.
The only real issue we need to work around is that we need to re-open
the mount target after doing the mount in order to get a handle to the
mountpoint -- pathrs.Reopen() doesn't work in this case (it just
re-opens the inode under the mountpoint) so we need to do a naive
re-open using the full path. Note that if we used move_mount(2) this
wouldn't be a problem because we would have a handle to the mountpoint
itself.
Note that this is still somewhat of a temporary solution -- ideally
mountViaFds would use *os.File directly to let us avoid some other
issues with using bare /proc/... paths, as well as also letting us more
easily use the new mount API on modern kernels.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
sysctls could in principle also be used as a write gadget for arbitrary
procfs files. As this requires getting a non-subset=pid /proc handle we
amortise this by only allocating a single procfs handle for all sysctl
writes.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
All of the callers of EnsureProcHandle now use filepath-securejoin's
ProcThreadSelf to get a file handle, which has much stricter
verification to avoid procfs attacks than EnsureProcHandle's very
simplistic filesystem type check.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
From a safety perspective this might not be strictly required, but it
paves the way for us to remove utils.ProcThreadSelf.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
EnsureProcHandle only protects us against a tmpfs mount, but the risk of
a procfs path being used (such as /proc/self/sched) has been known for a
while. Now that filepath-securejoin has a reasonably safe procfs API,
switch to it.
Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
os.Create is shorthand for open(O_CREAT|O_TRUNC) *without* O_EXCL, which
is incredibly unsafe for us to do when interacting with a container
rootfs (especially before pivot_root) as an attacker could swap the
target path with a symlink that points to the host filesystem, causing
us to delete the contents of or create host files.
We did have a similar bug in CVE-2024-45310, but in that case we
(luckily) didn't have O_TRUNC set which avoided the worst possible case.
However, os.Create does set O_TRUNC and we were using it in scenarios
that may have been exploitable.
Because of how easy it us for us to accidentally introduce this kind of
bug, we should simply not allow the usage of os.Create in our entire
codebase.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).
The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This is primarily done out of an abudance of caution against runc exec
being attacked by a container where /dev/pts/ptmx has been replaced with
some other bad inode (a disconnected NFS handle, a symlink that goes
through a leaked runc file descriptor to reference a host ptmx, etc).
Unfortunately, we cannot trivially verify that /dev/pts/ptmx is actually
the /dev/pts from the container without storing stuff like the fsid in
the runc state.json, which is probably not worth the extra effort. This
should at least avoid the most concerning cases.
Reported-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
An attacker could make /dev/console a symlink. This presents two
possible issues:
1. os.Create will happily truncate targets, which could have resulted
in a worse version of CVE-2024-4531. Luckily, this all happens after
pivot_root(2) so the scope of that particular attack is fairly
limited (you are unlikely to be able to easily access host rootfs
files -- though it might be possible to take advantage of leaks such
as in CVE-2024-21626). However, O_CREAT|O_NOFOLLOW is what we should
be doing for all file creations.
2. Because we passed /dev/console as the only mount path (as opposed to
using a /proc/self/fd/$n path), an attacker could swap the symlink
to point to any other path and thus cause us to mount over some
other path. This is not as big of a problem because all the mounts
are in the container namespace after pivot_root(2), and users
usually can create arbitrary mount targets inside the container.
These issues don't seem particularly exploitable, but they deserve to be
hardened regardless.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The pty driver has very consistent allocation rules for the major:minor
numbers of /dev/pts/$n inodes, so it is possible to somewhat safely open
/dev/pts/* paths if we validate that the inode is the one we expect.
It is possible for an attacker to have over-mounted a pts peer from a
different devpts instance, but to fix this would require more tracking
of devpts instances than runc currently can do.
This means runc should continue to work on very old kernels.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
When opening the peer end of a pty, the old kernel API required us to
open /dev/pts/$num inside the container (at least since we fixed console
handling many years ago in commit 244c9fc426 ("*: console rewrite")).
The problem is that in a hostile container it is possible for
/dev/pts/$num to be an attacker-controlled symlink that runc can be
tricked into resolving when doing bind-mounts. This allows the attacker
to (among other things) persist /proc/... entries that are later masked
by runc, allowing an attacker to escape through the kernel.core_pattern
sysctl (/proc/sys/kernel/core_pattern). This is the original issue
reported by Lei Wang and Li Fu Bang in CVE-2025-52565.
However, it should be noted that this is not entirely a newly-discovered
problem. Way back in Linux 4.13 (2017), I added the TIOCGPTPEER ioctl,
which allows us to get a pty peer without touching the /dev/pts inside
the container. The original threat model was around an attacker
replacing /dev/pts/$n or /dev/pts/ptmx with some malicious inode (a DoS
inode, or possibly a PTY they wanted a confused deputy to operate on).
Unfortunately, there was no practical way for runc to cache a safe
O_PATH handle to /dev/pts/ptmx (unlike other runtimes like LXC, which
switched to TIOCGPTPEER way back in 2017). Since it wasn't clear how we
could protect against the main attack TIOCGPTPEER was meant to protect
against, we never switched to it (even though I implemented it
specifically to harden container runtimes).
Unfortunately, It turns out that mount *sources* are a threat we didn't
fully consider. Since TIOCGPTPEER already solves this problem entirely
for us in a race free way, we should just use that. In a later patch, we
will add some hardening for /dev/pts/$num opening to maintain support
for very old kernels (Linux 4.13 is very old at this point, but RHEL 7
is still kicking and is stuck on Linux 3.10).
Fixes: GHSA-qw9x-cqr3-wc7r CVE-2025-52565
Reported-by: Lei Wang <ssst0n3@gmail.com> (CVE-2025-52565)
Reported-by: lfbzhm <lifubang@acmcoder.com> (CVE-2025-52565)
Reported-by: Aleksa Sarai <cyphar@cyphar.com> (TIOCGPTPEER)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Currently, we rely on mount returning ENOTDIR when the destination is a
directory (and so mount tells us that the source is not), and fall back
to read-only tmpfs bind mount for such cases.
Theoretically, ENOTDIR can also be returned in some other cases,
resulting in the wrong type of mount being used.
Let's be more straightforward here -- call fstat on destination file
descriptor, and use the proper mount depending on whether it is a
directory.
Reported-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When mounting a path being masked, the /dev/null might disappear from
under us, and mount (even on an opened /dev/null file descriptor) will
return ENOENT, which we deliberately ignore, as there's no need to mask
non-existent paths.
Let's open the destination path and ignore ENOENT during open, then
mount via the destination file descriptor, not ignoring ENOENT.
Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The /dev/null in a container should not be trusted, because when /dev
is a bind mount, /dev/null is not created by runc itself.
1. Add isDevNull which checks the fd minor/major and device type,
and verifyDevNull which does the stat and the check.
2. Rewrite maskPath to open and check /dev/null, and use its fd to
perform mounts. Move the loop over the MaskPaths into the function,
and rename it to maskPaths.
3. reOpenDevNull: use verifyDevNull and isDevNull.
4. fixStdioPermissions: use isDevNull instead of stat.
Fixes: GHSA-9493-h29p-rfm2 CVE-2025-31133
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
filepath-securejoin v0.3 gave us a much safer re-open primitive, we
should use it to avoid any theoretical attacks. Rather than using it
direcly, add a small pathrs wrapper to make libpathrs migrations in the
future easier...
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We will have more wrappers around filepath-securejoin, and so move them
to their own specific package so that we can eventually use libpathrs
fairly cleanly (by swapping out the implementation).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Introduce and use iobail, xread, and xwrite wrappers so that we can
properly check read/write return value and call either bail or bailx on
error, with proper diagnostics (distinguishing failed read/write from a
short read/write).
This prevents the "Success" prefix in errors like:
failed to sync with stage-1: next state: Success
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Add a few missing sane_kill calls where they make sense.
Remove one useless sane_kill of stage2_pid, as during SYNC_USERMAP stage2
is not yet started. It is harmless yet it makes the code slightly harder
to read.
Set the child pid to -1 upon receiving SYNC_CHILD_FINISH
to minimize the chances of killing an unrelated process.
When a child sends SYNC_CHILD_FINISH it is about to exit
(although theoretically it could be stuck during debug logging).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We use bail to report fatal errors, and bail always append %m
(aka strerror(errno)). In case an error condition did not set
errno, the log message will end up with ": Success" or an error
from a stale errno value. Either case is confusing for users.
Introduce bailx which is the same as bail except it does not
append %m, and use it where appropriate.
The naming follows libc's err(3) and errx(3).
PS we still use bail in a few cases after read or write, even
if that read/write did not return an error, because the code
does not distinguish between short read/write and error (-1).
This will be addressed by the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since sane_kill after a failed read or write, but before reporting the
error from that read or write, it may change the errno value in case
kill(2) fails.
Save and restore the errno around the call to kill.
While at it,
- change the code to return early;
- don't return kill return value as no one is using it, and the errno
value no longer correlates.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>