Commit Graph

230 Commits

Author SHA1 Message Date
Curd Becker
536e183451 Replace os.Is* error checking functions with their errors.Is counterpart
Signed-off-by: Curd Becker <me@curd-becker.de>
2025-12-11 03:16:02 +01:00
Akihiro Suda
64c3c8eea6 Merge pull request #4994 from kolyshkin/gofumpt-extra
Enable gofumpt extra rules
2025-11-28 09:30:57 +09:00
Aleksa Sarai
195e9551e4 pathrs: add MkdirAllParentInRoot helper
While CreateInRoot supports hallucinating the target path, we do not use
it directly when constructing device inode targets because we need to
have different handling for mknod and bind-mounts.

The solution is to simply have a more generic MkdirAllParentInRoot
helper that MkdirAll's the parent directory of the target path and then
allows the caller to create the trailing component however they like.
(This can be used by CreateInRoot internally as well!)

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
cfb74326be pathrs: add "hallucination" helpers for SecureJoin magic
In order to maintain compatibility with previous releases of runc (which
permitted dangling symlinks as path components by permitting
non-existent path components to be treated like real directories) we
have to first do SecureJoin to construct a target path that is
compatible with the old behaviour but has all dangling symlinks (or
other invalid paths like ".." components after non-existent directories)
removed.

This is effectively a more generic verison of commit 3f925525b4
("rootfs: re-allow dangling symlinks in mount targets") and will let us
remove the need for open-coding SecureJoin workarounds.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:05 +11:00
Aleksa Sarai
20c5a8ec4a pathrs: rename MkdirAllInRootOpen -> MkdirAllInRoot
Now that MkdirAllInRoot has been removed, we can make MkdirAllInRootOpen
less wordy by renaming it to MkdirAllInRoot. This is a non-functional
change.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:04:04 +11:00
Aleksa Sarai
9dbd37e06f libct: switch final WithProcfd users to WithProcfdFile
This probably should've been done as part of commit d40b3439a9
("rootfs: switch to fd-based handling of mountpoint targets") but it
seems I missed them when doing the rest of the conversions.

This also lets us remove utils.WithProcfd entirely, as well as
pathrs.MkdirAllInRoot. Unfortunately, WithProcfd was exposed in the
externally-importable "libcontainer/utils" package and so we need to
have a deprecation notice to remove it in runc 1.5.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:30 +11:00
Aleksa Sarai
42a1e19d67 libcontainer: move CleanPath and StripRoot to internal/pathrs
These helpers will be needed for the compatibility code added in future
patches in this series, but because "internal/pathrs" is imported by
"libcontainer/utils" we need to move them so that we can avoid circular
dependencies.

Because the old functions were in a non-internal package it is possible
some downstreams use them, so add some wrappers but mark them as
deprecated.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-26 21:03:29 +11:00
lifubang
6ac151d69b libct: add a defer fd close in createDeviceNode
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-20 19:43:22 +08:00
lifubang
69785c117c libct: always close m.dstFile in mountToRootfs
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-20 19:43:22 +08:00
Kir Kolyshkin
67840cce4b Enable gofumpt extra rules
Commit b2f8a74d "clothed" the naked return as inflicted by gofumpt
v0.9.0. Since gofumpt v0.9.2 this rule was moved to "extra" category,
not enabled by default. The only other "extra" rule is to group adjacent
parameters with the same type, which also makes sense.

Enable gofumpt "extra" rules, and reformat the code accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-10 13:18:45 -08:00
Aleksa Sarai
9a9719eeb4 rootfs: only set mode= for tmpfs mount if target already existed
This was always the intended behaviour but commit 72fbb34f50 ("rootfs:
switch to fd-based handling of mountpoint targets") regressed it when
adding a mechanism to create a file handle to the target if it didn't
already exist (causing the later stat to always succeed).

A lot of people depend on this functionality, so add some tests to make
sure we don't break it in the future.

Fixes: 72fbb34f50 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-08 23:11:57 +11:00
Kir Kolyshkin
1b954f1f06 libct: fix mips compilation
On MIPS arches, Rdev is uint32 so we have to convert it.

Fixes issue 4962.

Fixes: 8476df83 ("libct: add/use isDevNull, verifyDevNull")
Fixes: de87203e ("console: verify /dev/pts/ptmx before use")
Fixes: 398955bc ("console: add fallback for pre-TIOCGPTPEER kernels")
Reported-by: Tianon Gravi <admwiggin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-05 17:56:14 -08:00
Aleksa Sarai
3f925525b4 rootfs: re-allow dangling symlinks in mount targets
It seems there are a fair few images where dangling symlinks are used as
path components for mount targets, which pathrs-lite does not support
(and it would be difficult to fully support this in a race-free way).

This was actually meant to be blocked by commit 63c2908164 ("rootfs:
try to scope MkdirAll to stay inside the rootfs"), followed by commit
dd827f7b71 ("utils: switch to securejoin.MkdirAllHandle"). However, we
still used SecureJoin to construct mountpoint targets, which means that
dangling symlinks were "resolved" before reaching pathrs-lite.

This patch basically re-adds this hack in order to reduce the breakages
we've seen so far.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-05 18:58:07 +11:00
Aleksa Sarai
d40b3439a9 rootfs: switch to fd-based handling of mountpoint targets
An attacker could race with us during mount configuration in order to
trick us into mounting over an unexpected path. This would bypass
checkProcMount() and would allow for security profiles to be left
unapplied by mounting over /proc/self/attr/... (or even more serious
outcomes such as killing the entire system by tricking runc into writing
strings to /proc/sysrq-trigger).

This is a larger issue with our current mount infrastructure, and the
ideal solution would be to rewrite it all to be fd-based (which would
also allow us to support the "new" mount API, which also avoids a bunch
of other issues with mount(8)). However, such a rewrite is not really
workable as a security fix, so this patch is a bit of a compromise
approach to fix the issue while also moving us a bit towards that
eventual end-goal.

The core issue in CVE-2025-52881 is that we currently use the (insecure)
SecureJoin to re-resolve mountpoint target paths multiple times during
mounting. Rather than generating a string from createMountpoint(), we
instead open an *os.File handle to the target mountpoint directly and
then operate on that handle. This will make it easier to remove
utils.WithProcfd() and rework mountViaFds() in the future.

The only real issue we need to work around is that we need to re-open
the mount target after doing the mount in order to get a handle to the
mountpoint -- pathrs.Reopen() doesn't work in this case (it just
re-opens the inode under the mountpoint) so we need to do a naive
re-open using the full path. Note that if we used move_mount(2) this
wouldn't be a problem because we would have a handle to the mountpoint
itself.

Note that this is still somewhat of a temporary solution -- ideally
mountViaFds would use *os.File directly to let us avoid some other
issues with using bare /proc/... paths, as well as also letting us more
easily use the new mount API on modern kernels.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Co-developed-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:06 +11:00
lifubang
4b37cd93f8 libct: align param type for mountCgroupV1/V2 functions
Signed-off-by: lifubang <lifubang@acmcoder.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
77d217c7c3 init: write sysctls using safe procfs API
sysctls could in principle also be used as a write gadget for arbitrary
procfs files. As this requires getting a non-subset=pid /proc handle we
amortise this by only allocating a single procfs handle for all sysctl
writes.

Fixes: GHSA-cgrx-mc8f-2prm CVE-2025-52881
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:05 +11:00
Aleksa Sarai
01de9d65dc rootfs: avoid using os.Create for new device inodes
If an attacker were to make the target of a device inode creation be a
symlink to some host path, os.Create would happily truncate the target
which could lead to all sorts of issues. This exploit is probably not as
exploitable because device inodes are usually only bind-mounted for
rootless containers, which cannot overwrite important host files (though
user files would still be up for grabs).

The regular inode creation logic could also theoretically be tricked
into changing the access mode and ownership of host files if the
newly-created device inode was swapped with a symlink to a host path.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:04 +11:00
Kir Kolyshkin
5d7b242407 libct: maskPaths: don't rely on ENOTDIR for mount
Currently, we rely on mount returning ENOTDIR when the destination is a
directory (and so mount tells us that the source is not), and fall back
to read-only tmpfs bind mount for such cases.

Theoretically, ENOTDIR can also be returned in some other cases,
resulting in the wrong type of mount being used.

Let's be more straightforward here -- call fstat on destination file
descriptor, and use the proper mount depending on whether it is a
directory.

Reported-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
1a30a8f3d9 libct: maskPaths: only ignore ENOENT on mount dest
When mounting a path being masked, the /dev/null might disappear from
under us, and mount (even on an opened /dev/null file descriptor) will
return ENOENT, which we deliberately ignore, as there's no need to mask
non-existent paths.

Let's open the destination path and ignore ENOENT during open, then
mount via the destination file descriptor, not ignoring ENOENT.

Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
8476df83b5 libct: add/use isDevNull, verifyDevNull
The /dev/null in a container should not be trusted, because when /dev
is a bind mount, /dev/null is not created by runc itself.

1. Add isDevNull which checks the fd minor/major and device type,
   and verifyDevNull which does the stat and the check.

2. Rewrite maskPath to open and check /dev/null, and use its fd to
   perform mounts. Move the loop over the MaskPaths into the function,
   and rename it to maskPaths.

3. reOpenDevNull: use verifyDevNull and isDevNull.

4. fixStdioPermissions: use isDevNull instead of stat.

Fixes: GHSA-9493-h29p-rfm2 CVE-2025-31133
Co-authored-by: Rodrigo Campos <rodrigoca@microsoft.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Aleksa Sarai
6fc1914491 internal: move utils.MkdirAllInRoot to internal/pathrs
We will have more wrappers around filepath-securejoin, and so move them
to their own specific package so that we can eventually use libpathrs
fairly cleanly (by swapping out the implementation).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-01 21:24:02 +11:00
Kir Kolyshkin
b2f8a74de5 all: format sources with gofumpt v0.9.1
Since gofumpt v0.9.0 there's a new formatting rule to "clothe" any naked
returns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-10-07 17:08:56 -07:00
Kir Kolyshkin
89e59902c4 Modernize code for Go 1.24
Brought to you by

	modernize -fix -test ./...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-08-27 19:11:02 -07:00
Aleksa Sarai
3620185d06 rootfs: remove /proc/net/dev from allowed overmount list
This was added in 2ee9cbbd12 ("It's /proc/stat, not /proc/stats") with
no actual justification, and doesn't really make much sense on further
inspection:

 * /proc/net is a symlink to "self/net", which means that /proc/net/dev
   is a per-process file, and so overmounting it would only affect pid1.
   Any other program that cares about /proc/net/dev would see their own
   process's configuration, and unprivileged processes wouldn't be able
   to see /proc/1/... data anyway.

   In addition, the fact that this is a symlink means that runc will
   deny the overmount because /proc/1/net/dev is not in the proc
   overmount allowlist. This means that this has not worked for many
   years, and probably never worked in the first place.

 * /proc/self/net is already namespaced with network namespaces, so the
   primary argument for allowing /proc overmounts (lxcfs-like masking of
   procfs files to emulate namespacing for files that are not properly
   namespaced for containers -- such as /proc/cpuinfo) is moot.

   It goes without saying that lxcfs has never overmounted
   /proc/self/net/... files, so the general "because lxcfs"
   justification doesn't hold water either.

 * The kernel has slowly been moving towards blocking overmounts in
   /proc/self/. Linux 6.12 blocked overmounts for fd, fdinfo, and
   map_files; future Linux versions will probably end up blocking
   everything under /proc/self/.

Fixes: 2ee9cbbd12 ("It's /proc/stat, not /proc/stats")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-07-20 15:40:37 +10:00
Yusuke Sakurai
04be81b6a3 fix rootfs propagation mode
Signed-off-by: Yusuke Sakurai <yusuke.sakurai@3-shake.com>
2025-05-19 12:55:35 +00:00
Aleksa Sarai
58c3ab77b0 rootfs: improve error messages for bind-mount vfs flag setting
While debugging an issue involving failing mounts, I discovered that
just returning the plain mount error message when we are in the fallback
code for handling locked mounts leads to unnecessary confusion.

It also doesn't help that podman currently forcefully sets "rw" on
mounts, which means that rootless containers are likely to hit the
locked mounts issue fairly often.

So we should improve our error messages to explain why the mount is
failing in the locked flags case.

Fixes: 7c71a22705 ("rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-04-21 13:01:03 +10:00
Kir Kolyshkin
17570625c0 Use for range over integers
This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-31 17:15:06 -07:00
Sebastiaan van Stijn
e8a97bae27 Merge pull request #4692 from kolyshkin/golangci-v2
ci: switch to golangci-lint v2
2025-03-31 16:31:28 +02:00
Kir Kolyshkin
e655abc0da int/linux: add/use Dup3, Open, Openat
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-26 14:16:53 -07:00
Kir Kolyshkin
9510ffb658 Fix a few staticcheck QF1001 warnings
Like these:

> libcontainer/criu_linux.go:959:3: QF1001: could apply De Morgan's law (staticcheck)
> 		!(req.GetType() == criurpc.CriuReqType_FEATURE_CHECK ||
> 		^
> libcontainer/rootfs_linux.go:360:19: QF1001: could apply De Morgan's law (staticcheck)
> 	if err == nil || !(errors.Is(err, unix.EPERM) || errors.Is(err, unix.EBUSY)) {
> 	                 ^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-25 16:06:44 -07:00
Kir Kolyshkin
6405725ca2 libct: fix staticcheck QF1006 warning
> libcontainer/rootfs_linux.go:1255:13: QF1004: could use strings.ReplaceAll instead (staticcheck)
> 	keyPath := strings.Replace(key, ".", "/", -1)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-25 16:06:44 -07:00
Kir Kolyshkin
a638f1330b .golangci.yml: add nolintlint, fix found issues
The errrolint linter can finally ignore errors from Close,
and it also ignores direct comparisons of errors from x/sys/unix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-24 11:59:54 -07:00
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Kir Kolyshkin
f26ec92221 libct: rm Rootless* properties from initConfig
They are passed in initConfig twice, so it does not make sense.

NB: the alternative to that would be to remove Config field from
initConfig, but it results in a much bigger patch and more maintenance
down the road.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
6c9ddcc648 libct: switch from libct/devices to libct/cgroups/devices/config
Use the old package name as an alias to minimize the patch.

No functional change; this just eliminates a bunch of deprecation
warnings.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-31 16:51:09 -08:00
Kir Kolyshkin
93091e6ac2 libct: don't pass SpecState to init unless needed
SpecState field of initConfig is only needed to run hooks that are
executed inside a container -- namely CreateContainer and
StartContainer.

If these hooks are not configured, there is no need to fill, marshal and
unmarshal SpecState.

While at it, inline updateSpecState as it is trivial and only has one user.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:52:15 -08:00
Sebastiaan van Stijn
9b60a93cf3 libcontainer/userns: migrate to github.com/moby/sys/userns
The userns package was moved to the moby/sys/userns module
at commit 3778ae603c.

This patch deprecates the old location, and adds it as an alias
for the moby/sys/userns package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-10-09 22:20:25 +08:00
Kir Kolyshkin
13a6f56097 runc run: fix mount leak
When preparing to mount container root, we need to make its parent mount
private (i.e. disable propagation), otherwise the new in-container
mounts are leaked to the host.

To find a parent mount, we use to read mountinfo and find the longest
entry which can be a parent of the container root directory.

Unfortunately, due to kernel bug in all Linux kernels older than v5.8
(see [1], [2]), sometimes mountinfo can't be read in its entirety. In
this case, getParentMount may occasionally return a wrong parent mount.

As a result, we do not change the mount propagation to private, and
container mounts are leaked.

Alas, we can not fix the kernel, and reading mountinfo a few times to
ensure its consistency (like it's done in, say, Kubernetes) does not
look like a good solution for performance reasons.

Fortunately, we don't need mountinfo. Let's just traverse the directory
tree, trying to remount it private until we find a mount point (any
error other than EINVAL means we just found it).

Fixes issue 2404.

[1]: https://github.com/kolyshkin/procfs-test
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-10-02 13:58:27 -07:00
Aleksa Sarai
63c2908164 rootfs: try to scope MkdirAll to stay inside the rootfs
While we use SecureJoin to try to make all of our target paths inside
the container safe, SecureJoin is not safe against an attacker than can
change the path after we "resolve" it.

os.MkdirAll can inadvertently follow symlinks and thus an attacker could
end up tricking runc into creating empty directories on the host (note
that the container doesn't get access to these directories, and the host
just sees empty directories). However, this could potentially cause DoS
issues by (for instance) creating a directory in a conf.d directory for
a daemon that doesn't handle subdirectories properly.

In addition, the handling for creating file bind-mounts did a plain
open(O_CREAT) on the SecureJoin'd path, which is even more obviously
unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an
attacker to cause data loss...). Regardless of the symlink issue,
opening an untrusted file could result in a DoS if the file is a hung
tty or some other "nasty" file. We can use mknodat to safely create a
regular file without opening anything anyway (O_CREAT|O_EXCL would also
work but it makes the logic a bit more complicated, and we don't want to
open the file for any particular reason anyway).

libpathrs[1] is the long-term solution for these kinds of problems, but
for now we can patch this particular issue by creating a more restricted
MkdirAll that refuses to resolve symlinks and does the creation using
file descriptors. This is loosely based on a more secure version that
filepath-securejoin now has[2] and will be added to libpathrs soon[3].

[1]: https://github.com/openSUSE/libpathrs
[2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0
[3]: https://github.com/openSUSE/libpathrs/issues/10

Fixes: CVE-2024-45310
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-09-03 02:34:13 +10:00
Aleksa Sarai
1410a6988d rootfs: consolidate mountpoint creation logic
The logic for how we create mountpoints is spread over each mountpoint
preparation function, when in reality the behaviour is pretty uniform
with only a handful of exceptions. So just move it all to one function
that is easier to understand.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-07-25 14:16:05 +10:00
Sohan Kunkerkar
cde1d0908a libcontainer: force apps to think fips is enabled/disabled for testing
The motivation behind this change is to provide a flexible mechanism for
containers within a Kubernetes cluster to opt out of FIPS mode when necessary.
This change enables apps to simulate FIPS mode being enabled or disabled for testing
purposes. Users can control whether apps believe FIPS mode is on or off by manipulating
`/proc/sys/crypto/fips_enabled`.

Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
2024-04-10 18:58:34 -04:00
Aleksa Sarai
cdff09ab87 rootfs: fix 'can we mount on top of /proc' check
Our previous test for whether we can mount on top of /proc incorrectly
assumed that it would only be called with bind-mount sources. This meant
that having a non bind-mount entry for a pseudo-filesystem (like
overlayfs) with a dummy source set to /proc on the host would let you
bypass the check, which could easily lead to security issues.

In addition, the check should be applied more uniformly to all mount
types, so fix that as well. And add some tests for some of the tricky
cases to make sure we protect against them properly.

Fixes: 331692baa7 ("Only allow proc mount if it is procfs")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Aleksa Sarai
7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
Aleksa Sarai
9350f9013e merge #4039 into opencontainers/runc:main
Kir Kolyshkin (1):
  libct: use chmod instead of umask

LGTMs: lifubang cyphar
2023-10-04 14:55:07 +11:00
Kir Kolyshkin
2e2ecf29ff libct: use chmod instead of umask
Umask is problematic for Go programs as it affects other goroutines
(see [1] for more details).

Instead of using it, let's just prop up with Chmod.

Note this patch misses the MkdirAll call in createDeviceNode. Since the
runtime spec does not say anything about creating intermediary
directories for device nodes, let's assume that doing it via mkdir with
the current umask set is sufficient (if not, we have to reimplement
MkdirAll from scratch, with added call to os.Chmod).

[1] https://github.com/opencontainers/runc/pull/3563#discussion_r990293788

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-09-27 16:46:53 -07:00
Aleksa Sarai
8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
Kir Kolyshkin
6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Aleksa Sarai
5c7839b503 rootfs: use empty src for MS_REMOUNT
The kernel ignores these arguments, and passing them can lead to
confusing error messages (the old source is irrelevant for MS_REMOUNT),
as well as causing issues for a future patch where we switch to
move_mount(2).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-15 19:54:24 -07:00