Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).
Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.
This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
While it doesn't make much of a practical difference, it seems far more
reasonable to use os.NewFile to wrap all of our passed file descriptors
to make sure they're tracked by the Go runtime and that we don't
double-close them.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This gives a more clear error message when idmap mounts are not
supported on the source filesystem. For example, a k8s user will see
this now in kubectl describe pod:
Warning Failed 2s (x2 over 4s) kubelet, 127.0.0.1 Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: failed to fulfil mount request: failed to set MOUNT_ATTR_IDMAP on /var/lib/kubelet/pods/f037a704-742c-40fe-8dbf-17ed9225c4df/volumes/kubernetes.io~empty-dir/hugepage: invalid argument (maybe the source filesystem doesn't support idmap mounts on this kernel?): unknown
This gives a hint on where to look at.
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
ridmap indicates that the id mapping should be applied recursively (only
really relevant for rbind mount entries), and idmap indicates that it
should not be applied recursively (the default). If no mappings are
specified for the mount, we use the userns configuration of the
container. This matches the behaviour in the currently-unreleased
runtime-spec.
This includes a minor change to the state.json serialisation format, but
because there has been no released version of runc with commit
fbf183c6f8 ("Add uid and gid mappings to mounts"), we can safely make
this change without affecting running containers. Doing it this way
makes it much easier to handle m.IsIDMapped() and indicating that a
mapping has been specified.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.
However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.
This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).
In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):
goos: linux
goarch: amd64
pkg: github.com/opencontainers/runc/libcontainer/userns
cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
BenchmarkSpawnProc
BenchmarkSpawnProc-8 1670 770065 ns/op
Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Previously to this commit, we used a string for srcFD as /proc/self/fd/NN.
This commit modified to this behavior, so srcFD is only an *int and the full path
is constructed in mountViaFDs() if srcFD is different than nil.
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
1. Simplify mount call by removing the procfd argument, and use the new
mount() where procfd is not used. Now, the mount() arguments are the
same as for unix.Mount.
2. Introduce a new mountViaFDs function, which is similar to the old
mount(), except it can take procfd for both source and target.
The new arguments are called srcFD and dstFD.
3. Modify the mount error to show both srcFD and dstFD so it's clear
which one is used for which purpose. This fixes the issue of having
a somewhat cryptic errors like this:
> mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted
(in which fd 11 is actually the source, and fd 12 is the target).
After this change, it looks like
> mount src=/proc/self/fd/11, dst=/sys/fs/cgroup/systemd, dstFD=/proc/self/fd/12, flags=0x20502f: operation not permitted
so it's clear that 12 is a destination fd.
4. Fix the mountViaFDs callers to use dstFD (rather than procfd) for the
variable name.
5. Use srcFD where mountFd is set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.
Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>