3014 Commits

Author SHA1 Message Date
Kir Kolyshkin
652269729d libc/int: use strings.Builder
Generated by modernize@latest (v0.21.0).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-16 15:04:04 -08:00
Akihiro Suda
4dcda051da Merge pull request #5055 from kolyshkin/mpol-2
libct/configs: mark MPOL_* constants as deprecated
2025-12-16 10:39:09 +09:00
Curd Becker
536e183451 Replace os.Is* error checking functions with their errors.Is counterpart
Signed-off-by: Curd Becker <me@curd-becker.de>
2025-12-11 03:16:02 +01:00
Kir Kolyshkin
3741f9186d libct/configs: mark MPOL_* constants as deprecated
Alas, these new constants are already in v1.4.0 release so we can't
remove those right away, but we can mark them as deprecated now
and target removal for v1.5.0.

So,
 - mark them as deprecated;
 - redefine via unix.MPOL_* counterparts;
 - fix the validator code to use unix.MPOL_* directly.

This amends commit a0e809a8.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-08 15:36:29 -08:00
Kir Kolyshkin
8a9b4dcda6 libct: mountFd: close mountFile on error
Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin
c24965b742 libct: newProcessComm: close fds on error
Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin
88f897160c libct: startInitialization: add defer close
This function calls Init what normally never returns, so the defer only
works if there is an error and we can safely use it to close those fds
we opened. This was done for most but not all fds.

Reported in issue 5008.

Reported-by: Arina Cherednik <arinacherednik034@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-12-02 15:15:23 -08:00
Kir Kolyshkin
1f1ff4be06 Merge pull request #5051 from cyphar/libct-utils-deprecated
libct/utils: remove Deprecated functions
2025-12-02 15:06:01 -08:00
Akihiro Suda
64c3c8eea6 Merge pull request #4994 from kolyshkin/gofumpt-extra
Enable gofumpt extra rules
2025-11-28 09:30:57 +09:00
Aleksa Sarai
a412bd93e9 libct/utils: remove Deprecated functions
These were all marked for deprecation in runc 1.5.0, so remove them now
to make sure we don't forget.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-28 11:11:11 +11:00
Aleksa Sarai
195e9551e4 pathrs: add MkdirAllParentInRoot helper
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.

The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
cfb74326be pathrs: add "hallucination" helpers for SecureJoin magic
In order to maintain compatibility with previous releases of runc (which
permitted dangling symlinks as path components by permitting
non-existent path components to be treated like real directories) we
have to first do SecureJoin to construct a target path that is
compatible with the old behaviour but has all dangling symlinks (or
other invalid paths like ".." components after non-existent directories)
removed.

This is effectively a more generic verison of commit 3f925525b4
("rootfs: re-allow dangling symlinks in mount targets") and will let us
remove the need for open-coding SecureJoin workarounds.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
20c5a8ec4a pathrs: rename MkdirAllInRootOpen -> MkdirAllInRoot
Now that MkdirAllInRoot has been removed, we can make MkdirAllInRootOpen
less wordy by renaming it to MkdirAllInRoot. This is a non-functional
change.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:04 +11:00
Aleksa Sarai
9dbd37e06f libct: switch final WithProcfd users to WithProcfdFile
This probably should've been done as part of commit d40b3439a9
("rootfs: switch to fd-based handling of mountpoint targets") but it
seems I missed them when doing the rest of the conversions.

This also lets us remove utils.WithProcfd entirely, as well as
pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the
externally-importable "libcontainer/utils" package and so we need to
have a deprecation notice to remove it in runc 1.5.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:30 +11:00
Aleksa Sarai
42a1e19d67 libcontainer: move CleanPath and StripRoot to internal/pathrs
These helpers will be needed for the compatibility code added in future
patches in this series, but because "internal/pathrs" is imported by
"libcontainer/utils" we need to move them so that we can avoid circular
dependencies.

Because the old functions were in a non-internal package it is possible
some downstreams use them, so add some wrappers but mark them as
deprecated.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:29 +11:00
Akihiro Suda
475473d869 Merge pull request #5026 from lifubang/ci-detect-fdleak-try-best
fix fd leaks and detect them as comprehensively as possible
2025-11-26 08:33:53 +09:00
Kir Kolyshkin
f944ccecb2 runc create/run/exec: show fatal errors from init
In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().

The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when `runc init` fails, we return the error from the parent (which is
usually not too helpful, for example):

	runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:

	FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
	FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state
	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.

This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:

	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state

Yes, it is long and ugly, but at least the upper level runtime will
report it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-22 12:11:20 +07:00
lifubang
6ac151d69b libct: add a defer fd close in createDeviceNode
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-20 19:43:22 +08:00
lifubang
69785c117c libct: always close m.dstFile in mountToRootfs
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-20 19:43:22 +08:00
Kir Kolyshkin
5fbc3bb019 libct/int: TestFdLeaks: deflake
Since the recent CVE fixes, TestFdLeaksSystemd sometimes fails:

	=== RUN   TestFdLeaksSystemd
	    exec_test.go:1750: extra fd 9 -> /12224/task/13831/fd
	    exec_test.go:1753: found 1 extra fds after container.Run
	--- FAIL: TestFdLeaksSystemd (0.10s)

It might have been caused by the change to the test code in commit
ff6fe13 ("utils: use safe procfs for /proc/self/fd loop code") -- we are
now opening a file descriptor during the logic to get a list of file
descriptors. If the file descriptor happens to be allocated to a
different number, you'll get an error.

Let's try to filter out the fd used to read a directory.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-20 15:36:14 +08:00
Aleksa Sarai
3b75374cc7 runtime-spec: update pids.limit handling to match new guidance
The main update is actually in github.com/opencontainers/cgroups, but we
need to also update runtime-spec to a newer pre-release version to get
the updates from there as well.

In short, the behaviour change is now that "0" is treated as a valid
value to set in "pids.max", "-1" means "max" and unset/nil means "do
nothing". As described in the opencontainers/cgroups PR, this change is
actually backwards compatible because our internal state.json stores
PidsLimit, and that entry is marked as "omitempty". So, an old runc
would omit PidsLimit=0 in state.json, and this will be parsed by a new
runc as being "nil" -- and both would treat this case as "do not set
anything".

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-11 15:15:27 +11:00
Kir Kolyshkin
67840cce4b Enable gofumpt extra rules
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.

Enable gofumpt "extra" rules, and reformat the code accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-10 13:18:45 -08:00
Aleksa Sarai
eec1f7e34b merge #4973 into opencontainers/runc:main
Aleksa Sarai (1):
  rootfs: only set mode= for tmpfs mount if target already existed

LGTMS: lifubang thaJeztah
2025-11-11 03:10:01 +11:00
Aleksa Sarai
a0e809a8ba libct: switch to unix.SetMemPolicy wrapper
This is mostly a mechanical change, but we also need to change some
types to match the "mode int" argument that golang.org/x/sys/unix
decided to use.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-10 16:03:02 +11:00
Aleksa Sarai
9a9719eeb4 rootfs: only set mode= for tmpfs mount if target already existed
This was always the intended behaviour but commit 72fbb34f50 ("rootfs:
switch to fd-based handling of mountpoint targets") regressed it when
adding a mechanism to create a file handle to the target if it didn't
already exist (causing the later stat to always succeed).

A lot of people depend on this functionality, so add some tests to make
sure we don't break it in the future.

Fixes: 72fbb34f50 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-08 23:11:57 +11:00
Kir Kolyshkin
1b954f1f06 libct: fix mips compilation
On MIPS arches, Rdev is uint32 so we have to convert it.

Fixes issue 4962.

Fixes: 8476df83 ("libct: add/use isDevNull, verifyDevNull")
Fixes: de87203e ("console: verify /dev/pts/ptmx before use")
Fixes: 398955bc ("console: add fallback for pre-TIOCGPTPEER kernels")
Reported-by: Tianon Gravi <admwiggin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-05 17:56:14 -08:00
Aleksa Sarai
3f925525b4 rootfs: re-allow dangling symlinks in mount targets
It seems there are a fair few images where dangling symlinks are used as
path components for mount targets, which pathrs-lite does not support
(and it would be difficult to fully support this in a race-free way).

This was actually meant to be blocked by commit 63c2908164 ("rootfs:
try to scope MkdirAll to stay inside the rootfs"), followed by commit
dd827f7b71 ("utils: switch to securejoin.MkdirAllHandle"). However, we
still used SecureJoin to construct mountpoint targets, which means that
dangling symlinks were "resolved" before reaching pathrs-lite.

This patch basically re-adds this hack in order to reduce the breakages
we've seen so far.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-05 18:58:07 +11:00
Aleksa Sarai
d40b3439a9 rootfs: switch to fd-based handling of mountpoint targets
An attacker could race with us during mount configuration in order to
trick us into mounting over an unexpected path. This would bypass
checkProcMount() and would allow for security profiles to be left
unapplied by mounting over /proc/self/attr/... (or even more serious
outcomes such as killing the entire system by tricking runc into writing
strings to /proc/sysrq-trigger).

This is a larger issue with our current mount infrastructure, and the
ideal solution would be to rewrite it all to be fd-based (which would
also allow us to support the "new" mount API, which also avoids a bunch
of other issues with mount(8)). However, such a rewrite is not really
workable as a security fix, so this patch is a bit of a compromise
approach to fix the issue while also moving us a bit towards that
eventual end-goal.

The core issue in CVE-2025-52881 is that we currently use the (insecure)
SecureJoin to re-resolve mountpoint target paths multiple times during
mounting. Rather than generating a string from createMountpoint(), we
instead open an *os.File handle to the target mountpoint directly and
then operate on that handle. This will make it easier to remove
utils.WithProcfd() and rework mountViaFds() in the future.

The only real issue we need to work around is that we need to re-open
the mount target after doing the mount in order to get a handle to the
mountpoint -- pathrs.Reopen() doesn't work in this case (it just
re-opens the inode under the mountpoint) so we need to do a naive
re-open using the full path. Note that if we used move_mount(2) this
wouldn't be a problem because we would have a handle to the mountpoint
itself.

Note that this is still somewhat of a temporary solution -- ideally
mountViaFds would use *os.File directly to let us avoid some other
issues with using bare /proc/... paths, as well as also letting us more
easily use the new mount API on modern kernels.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:06 +11:00
lifubang
4b37cd93f8 libct: align param type for mountCgroupV1/V2 functions
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
d61fd29d85 libct/system: use securejoin for /proc/$pid/stat
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
435cc81be6 init: use securejoin for /proc/self/setgroups
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
77d217c7c3 init: write sysctls using safe procfs API
sysctls could in principle also be used as a write gadget for arbitrary
procfs files. As this requires getting a non-subset=pid /proc handle we
amortise this by only allocating a single procfs handle for all sysctl
writes.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
b3dd1bc562 utils: remove unneeded EnsureProcHandle
All of the callers of EnsureProcHandle now use filepath-securejoin's
ProcThreadSelf to get a file handle, which has much stricter
verification to avoid procfs attacks than EnsureProcHandle's very
simplistic filesystem type check.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
ff6fe13246 utils: use safe procfs for /proc/self/fd loop code
From a safety perspective this might not be strictly required, but it
paves the way for us to remove utils.ProcThreadSelf.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
fdcc9d3cad apparmor: use safe procfs API for labels
EnsureProcHandle only protects us against a tmpfs mount, but the risk of
a procfs path being used (such as /proc/self/sched) has been known for a
while. Now that filepath-securejoin has a reasonably safe procfs API,
switch to it.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
aee7d3fe35 ci: add lint to forbid the usage of os.Create
os.Create is shorthand for open(O_CREAT|O_TRUNC) *without* O_EXCL, which
is incredibly unsafe for us to do when interacting with a container
rootfs (especially before pivot_root) as an attacker could swap the
target path with a symlink that points to the host filesystem, causing
us to delete the contents of or create host files.

We did have a similar bug in CVE-2024-45310, but in that case we
(luckily) didn't have O_TRUNC set which avoided the worst possible case.
However, os.Create does set O_TRUNC and we were using it in scenarios
that may have been exploitable.

Because of how easy it us for us to accidentally introduce this kind of
bug, we should simply not allow the usage of os.Create in our entire
codebase.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
01de9d65dc rootfs: avoid using os.Create for new device inodes
If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).

The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Aleksa Sarai
de87203e62 console: verify /dev/pts/ptmx before use
This is primarily done out of an abudance of caution against runc exec
being attacked by a container where /dev/pts/ptmx has been replaced with
some other bad inode (a disconnected NFS handle, a symlink that goes
through a leaked runc file descriptor to reference a host ptmx, etc).

Unfortunately, we cannot trivially verify that /dev/pts/ptmx is actually
the /dev/pts from the container without storing stuff like the fsid in
the runc state.json, which is probably not worth the extra effort. This
should at least avoid the most concerning cases.

Reported-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Aleksa Sarai
9be1dbf4ac console: avoid trivial symlink attacks for /dev/console
An attacker could make /dev/console a symlink. This presents two
possible issues:

 1. os.Create will happily truncate targets, which could have resulted
    in a worse version of CVE-2024-4531. Luckily, this all happens after
    pivot_root(2) so the scope of that particular attack is fairly
    limited (you are unlikely to be able to easily access host rootfs
    files -- though it might be possible to take advantage of leaks such
    as in CVE-2024-21626). However, O_CREAT|O_NOFOLLOW is what we should
    be doing for all file creations.

 2. Because we passed /dev/console as the only mount path (as opposed to
    using a /proc/self/fd/$n path), an attacker could swap the symlink
    to point to any other path and thus cause us to mount over some
    other path. This is not as big of a problem because all the mounts
    are in the container namespace after pivot_root(2), and users
    usually can create arbitrary mount targets inside the container.

These issues don't seem particularly exploitable, but they deserve to be
hardened regardless.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Aleksa Sarai
398955bccb console: add fallback for pre-TIOCGPTPEER kernels
The pty driver has very consistent allocation rules for the major:minor
numbers of /dev/pts/$n inodes, so it is possible to somewhat safely open
/dev/pts/* paths if we validate that the inode is the one we expect.

It is possible for an attacker to have over-mounted a pts peer from a
different devpts instance, but to fix this would require more tracking
of devpts instances than runc currently can do.

This means runc should continue to work on very old kernels.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Aleksa Sarai
531ef794e4 console: use TIOCGPTPEER when allocating peer PTY
When opening the peer end of a pty, the old kernel API required us to
open /dev/pts/$num inside the container (at least since we fixed console
handling many years ago in commit 244c9fc426 ("*: console rewrite")).

The problem is that in a hostile container it is possible for
/dev/pts/$num to be an attacker-controlled symlink that runc can be
tricked into resolving when doing bind-mounts. This allows the attacker
to (among other things) persist /proc/... entries that are later masked
by runc, allowing an attacker to escape through the kernel.core_pattern
sysctl (/proc/sys/kernel/core_pattern). This is the original issue
reported by Lei Wang and Li Fu Bang in CVE-2025-52565.

However, it should be noted that this is not entirely a newly-discovered
problem. Way back in Linux 4.13 (2017), I added the TIOCGPTPEER ioctl,
which allows us to get a pty peer without touching the /dev/pts inside
the container. The original threat model was around an attacker
replacing /dev/pts/$n or /dev/pts/ptmx with some malicious inode (a DoS
inode, or possibly a PTY they wanted a confused deputy to operate on).
Unfortunately, there was no practical way for runc to cache a safe
O_PATH handle to /dev/pts/ptmx (unlike other runtimes like LXC, which
switched to TIOCGPTPEER way back in 2017). Since it wasn't clear how we
could protect against the main attack TIOCGPTPEER was meant to protect
against, we never switched to it (even though I implemented it
specifically to harden container runtimes).

Unfortunately, It turns out that mount *sources* are a threat we didn't
fully consider. Since TIOCGPTPEER already solves this problem entirely
for us in a race free way, we should just use that. In a later patch, we
will add some hardening for /dev/pts/$num opening to maintain support
for very old kernels (Linux 4.13 is very old at this point, but RHEL 7
is still kicking and is stuck on Linux 3.10).

Fixes: GHSA-qw9x-cqr3-wc7r CVE-2025-52565
Reported-by: Lei Wang <ssst0n3@gmail.com> (CVE-2025-52565)
Reported-by: lfbzhm <lifubang@acmcoder.com> (CVE-2025-52565)
Reported-by: Aleksa Sarai <cyphar@cyphar.com> (TIOCGPTPEER)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:03 +11:00
Kir Kolyshkin
5d7b242407 libct: maskPaths: don't rely on ENOTDIR for mount
Currently, we rely on mount returning ENOTDIR when the destination is a
directory (and so mount tells us that the source is not), and fall back
to read-only tmpfs bind mount for such cases.

Theoretically, ENOTDIR can also be returned in some other cases,
resulting in the wrong type of mount being used.

Let's be more straightforward here -- call fstat on destination file
descriptor, and use the proper mount depending on whether it is a
directory.

Reported-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
1a30a8f3d9 libct: maskPaths: only ignore ENOENT on mount dest
When mounting a path being masked, the /dev/null might disappear from
under us, and mount (even on an opened /dev/null file descriptor) will
return ENOENT, which we deliberately ignore, as there's no need to mask
non-existent paths.

Let's open the destination path and ignore ENOENT during open, then
mount via the destination file descriptor, not ignoring ENOENT.

Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
8476df83b5 libct: add/use isDevNull, verifyDevNull
The /dev/null in a container should not be trusted, because when /dev
is a bind mount, /dev/null is not created by runc itself.

1. Add isDevNull which checks the fd minor/major and device type,
   and verifyDevNull which does the stat and the check.

2. Rewrite maskPath to open and check /dev/null, and use its fd to
   perform mounts. Move the loop over the MaskPaths into the function,
   and rename it to maskPaths.

3. reOpenDevNull: use verifyDevNull and isDevNull.

4. fixStdioPermissions: use isDevNull instead of stat.

Fixes: GHSA-9493-h29p-rfm2 CVE-2025-31133
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Aleksa Sarai
ff94f9991b *: switch to safer securejoin.Reopen
filepath-securejoin v0.3 gave us a much safer re-open primitive, we
should use it to avoid any theoretical attacks. Rather than using it
direcly, add a small pathrs wrapper to make libpathrs migrations in the
future easier...

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Aleksa Sarai
6fc1914491 internal: move utils.MkdirAllInRoot to internal/pathrs
We will have more wrappers around filepath-securejoin, and so move them
to their own specific package so that we can eventually use libpathrs
fairly cleanly (by swapping out the implementation).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
6c18b25cdc libct/nsenter: better read/write errors
Introduce and use iobail, xread, and xwrite wrappers so that we can
properly check read/write return value and call either bail or bailx on
error, with proper diagnostics (distinguishing failed read/write from a
short read/write).

This prevents the "Success" prefix in errors like:

	failed to sync with stage-1: next state: Success

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-28 17:26:09 -07:00
Kir Kolyshkin
aea52d0ab0 libct/nsenter: sprinkle missing sane_kill
Add a few missing sane_kill calls where they make sense.

Remove one useless sane_kill of stage2_pid, as during SYNC_USERMAP stage2
is not yet started. It is harmless yet it makes the code slightly harder
to read.

Set the child pid to -1 upon receiving SYNC_CHILD_FINISH
to minimize the chances of killing an unrelated process.
When a child sends SYNC_CHILD_FINISH it is about to exit
(although theoretically it could be stuck during debug logging).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-28 17:21:38 -07:00
Kir Kolyshkin
067b8335e7 libct/nsenter: add and use bailx
We use bail to report fatal errors, and bail always append %m
(aka strerror(errno)). In case an error condition did not set
errno, the log message will end up with ": Success" or an error
from a stale errno value. Either case is confusing for users.

Introduce bailx which is the same as bail except it does not
append %m, and use it where appropriate.

The naming follows libc's err(3) and errx(3).

PS we still use bail in a few cases after read or write, even
if that read/write did not return an error, because the code
does not distinguish between short read/write and error (-1).
This will be addressed by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-28 17:21:38 -07:00
Kir Kolyshkin
9c8f476cb6 libct/nsenter: save errno in sane_kill
Since sane_kill after a failed read or write, but before reporting the
error from that read or write, it may change the errno value in case
kill(2) fails.

Save and restore the errno around the call to kill.

While at it,
 - change the code to return early;
 - don't return kill return value as no one is using it, and the errno
   value no longer correlates.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-28 17:21:25 -07:00