Commit Graph

17 Commits

Author SHA1 Message Date
Aleksa Sarai
ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Chethan Suresh
ebc2e7c435 Support time namespace
"time" namespace was introduced in Linux v5.6
support new time namespace to set boottime and monotonic time offset

Example runtime spec

"timeOffsets": {
    "monotonic": {
        "secs": 172800,
        "nanosecs": 0
    },
    "boottime": {
        "secs": 604800,
        "nanosecs": 0
    }
}

Signed-off-by: Chethan Suresh <chethan.suresh@sony.com>
2023-08-03 10:12:01 +05:30
Rodrigo Campos
fda12ab101 Support idmap mounts on volumes
This commit adds support for idmap mounts as specified in the runtime-spec.

We open the idmap source paths and call mount_setattr() in runc PARENT,
as we need privileges in the init userns for that, and then sends the
fds to the child process. For this fd passing we use the same mechanism
used in other parts of thecode, the _LIBCONTAINER_ env vars.

The mount is finished (unix.MoveMount) from go code, inside the userns,
so we reuse all the prepareBindMount() security checks and the remount
logic for some flags too.

This commit only supports idmap mounts when userns are used AND the mappings
are the same specified for the userns mapping. This limitation is to
simplify the initial implementation, as all our users so far only need
this, and we can avoid sending over netlink the mappings, creating a
userns with this custom mapping, etc. Future PRs will remove this
limitation.

Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-17 13:30:12 +02:00
Aleksa Sarai
d72d057ba7 runc init: avoid netlink message length overflows
When writing netlink messages, it is possible to have a byte array
larger than UINT16_MAX which would result in the length field
overflowing and allowing user-controlled data to be parsed as control
characters (such as creating custom mount points, changing which set of
namespaces to allow, and so on).

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-12-03 16:07:40 +11:00
Alban Crequy
9c444070ec Open bind mount sources from the host userns
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:

  # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
  starting container process caused: process_linux.go:459:
  container init caused: rootfs_linux.go:58:
  mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
  to rootfs at \"/tmp/inaccessible\" caused:
  stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"

To solve this problem, this patch performs the following:

1. in nsexec.c, it opens the source path in the host userns (so we have
   the right permissions to open it) but in the container mntns (so the
   kernel cross mntns mount check let us mount it later:
   https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).

2. in nsexec.c, it passes the file descriptors of the source to the
   child process with SCM_RIGHTS.

3. In runc-init in Golang, it finishes the mounts while inside the
   userns even without access to the some components of the source
   paths.

Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.

This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.

This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.

Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).

This patch takes care of using O_CLOEXEC on mount fds, and close them
early.

Fixes: #2484.

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-10-12 15:13:45 +02:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Akihiro Suda
06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Akihiro Suda
ed58366cc8 libcontainer: fix Boolmsg alignment
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-26 14:44:03 +09:00
Giuseppe Scrivano
d8b669400a rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Tobias Klauser
306b4980f7 Use NLA_* constants from x/sys/unix instead of syscall
Use the NLA_ALIGNTO and NLA_HDRLEN constants from x/sys/unix instead of
syscall, as the syscall package shouldn't be used anymore (except for a
few exceptions).

This also makes the syscall_NLA_HDRLEN workaround for gccgo unnecessary.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-02 10:42:11 +02:00
Aleksa Sarai
d2f49696b0 runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Aleksa Sarai
6bd4bd9030 *: handle unprivileged operations and !dumpable
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.

!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.

This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).

In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:19 +11:00
Aleksa Sarai
244c9fc426 *: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Akihiro Suda
1829531241 Fix trivial style errors reported by go vet and golint
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
2016-04-12 08:13:16 +00:00
Daniel, Dao Quang Minh
42d5d04801 Sets custom namespaces for init processes
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.

This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.

With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
2016-02-28 12:26:53 -08:00
Christy Perez
ced8e5e7ba Caclulate NLA_HDRLEN as gccgo workaround
syscall.NLA_HDRLEN is not in gccgo (as of 5.3), so in the meantime
use the #defines taken from linux/netlink.h.

See https://github.com/golang/go/issues/13629

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2015-12-17 17:36:47 -06:00
Daniel, Dao Quang Minh
7d423cb7a1 setns: replace env with netlink for bootstrap data
replace passing of pid and console path via environment variable with passing
them with netlink message via an established pipe.

this change requires us to set _LIBCONTAINER_INITTYPE and
_LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we
only send the bootstrap data for setns process right now. When init and setns
bootstrap process are unified (i.e., init use nsexec instead of Go to clone new
process), we can remove _LIBCONTAINER_INITTYPE.

Note:
- we read nlmsghdr first before reading the content so we can get the total
  length of the payload and allocate buffer properly instead of allocating
  one large buffer.

- check read bytes vs the wanted number. It's an error if we failed to read
  the desired number of bytes from the pipe into the buffer.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
2015-12-03 18:03:48 +00:00