Commit Graph

181 Commits

Author SHA1 Message Date
Aleksa Sarai
5c7839b503 rootfs: use empty src for MS_REMOUNT
The kernel ignores these arguments, and passing them can lead to
confusing error messages (the old source is irrelevant for MS_REMOUNT),
as well as causing issues for a future patch where we switch to
move_mount(2).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-08-15 19:54:24 -07:00
Aleksa Sarai
f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Rodrigo Campos
ec2ffae5f1 libct: Allow rel paths for idmap mounts
The idea was to make them strict on dest path from the beginning for
idmap mounts, as runc would do that for all mounts in the future. But
that is causing too many problems.

For now, let's just allow relative paths for idmap mounts too. It just
seems safer.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-08-08 13:45:31 +02:00
lifubang
6092a4b42d fix some file mode bits missing when doing mount syscall
Signed-off-by: lifubang <lifubang@acmcoder.com>
2023-08-03 08:44:00 +08:00
Ruediger Pluem
da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
Francis Laniel
a3785c88ec Remove idmapFD field for mountEntry
We cannot have both srcFD and idMapFD set at the same time.
So, we can simplify this struct to only have one field which is used a srcFD
most of the time and as idMapFD when we do an id map mount.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Francis Laniel
46ada59ba2 Use an *int for srcFD
Previously to this commit, we used a string for srcFD as /proc/self/fd/NN.
This commit modified to this behavior, so srcFD is only an *int and the full path
is constructed in mountViaFDs() if srcFD is different than nil.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
2023-07-21 13:55:34 +02:00
Rodrigo Campos
fda12ab101 Support idmap mounts on volumes
This commit adds support for idmap mounts as specified in the runtime-spec.

We open the idmap source paths and call mount_setattr() in runc PARENT,
as we need privileges in the init userns for that, and then sends the
fds to the child process. For this fd passing we use the same mechanism
used in other parts of thecode, the _LIBCONTAINER_ env vars.

The mount is finished (unix.MoveMount) from go code, inside the userns,
so we reuse all the prepareBindMount() security checks and the remount
logic for some flags too.

This commit only supports idmap mounts when userns are used AND the mappings
are the same specified for the userns mapping. This limitation is to
simplify the initial implementation, as all our users so far only need
this, and we can avoid sending over netlink the mappings, creating a
userns with this custom mapping, etc. Future PRs will remove this
limitation.

Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-17 13:30:12 +02:00
Rodrigo Campos
fe4528b176 libcontainer: Just print the mountFds slice len on errors
Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
Rodrigo Campos
73b649705a libcontainer: Add mountFds struct
We will need to pass more slices of fds to these functions in future
patches. Let's add a struct that just contains them all, instead of
adding lot of parameters to these functions.

Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>
2023-07-11 16:17:48 +02:00
Brian Goff
9fa8b9de3e Fix tmpfs mode opts when dir already exists
When a directory already exists (or after a container is restarted) the
perms of the directory being mounted to were being used even when a
different permission is set on the tmpfs mount options.

This prepends the original directory perms to the mount options.
If the perms were already set in the mount opts then those perms will
win.
This eliminates the need to perform a chmod after mount entirely.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2023-06-26 21:53:21 +00:00
Kir Kolyshkin
a60933bb24 libct/rootfs: introduce and use mountEntry
Adding fd field to mountConfig was not a good thing since mountConfig
contains data that is not specific to a particular mount, while fd is
a mount entry attribute.

Introduce mountEntry structure, which embeds configs.Mount and adds
srcFd to replace the removed mountConfig.fd.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-05-02 18:54:38 -07:00
Kir Kolyshkin
976748e8d6 libct: add mountViaFDs, simplify mount
1. Simplify mount call by removing the procfd argument, and use the new
   mount() where procfd is not used. Now, the mount() arguments are the
   same as for unix.Mount.

2. Introduce a new mountViaFDs function, which is similar to the old
   mount(), except it can take procfd for both source and target.
   The new arguments are called srcFD and dstFD.

3. Modify the mount error to show both srcFD and dstFD so it's clear
   which one is used for which purpose. This fixes the issue of having
   a somewhat cryptic errors like this:

> mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted

  (in which fd 11 is actually the source, and fd 12 is the target).

   After this change, it looks like

> mount src=/proc/self/fd/11, dst=/sys/fs/cgroup/systemd, dstFD=/proc/self/fd/12, flags=0x20502f: operation not permitted

   so it's clear that 12 is a destination fd.

4. Fix the mountViaFDs callers to use dstFD (rather than procfd) for the
   variable name.

5. Use srcFD where mountFd is set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-05-02 18:41:09 -07:00
Qiang Huang
0d62b950e6 Merge pull request from GHSA-m8cg-xc2p-r3fc
rootless: fix /sys/fs/cgroup mounts
2023-03-29 14:18:15 +08:00
Kir Kolyshkin
da98076c97 mountToRootfs: minor refactor
The setRecAttr is only called for "bind" case, as cases end with a
return statement. Indeed, recursive mount attributes only make sense for
bind mounts.

Move the code to under case "bind" to improve readability. No change in
logic.

Fixes: 382eba4354
Reported-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-03-27 12:47:05 -07:00
Kir Kolyshkin
0d72adf96d Prohibit /proc and /sys to be symlinks
Commit 3291d66b98 introduced a check for /proc and /sys, making sure
the destination (dest) is a directory (and not e.g. a symlink).

Later, a hunk from commit 0ca91f44f switched from using filepath.Join
to SecureJoin for dest. As SecureJoin follows and resolves symlinks,
the check whether dest is a symlink no longer works.

To fix, do the check without/before using SecureJoin.

Add integration tests to make sure we won't regress.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-03-17 11:03:44 -07:00
Akihiro Suda
df4eae457b rootless: fix /sys/fs/cgroup mounts
It was found that rootless runc makes `/sys/fs/cgroup` writable in following conditons:

1. when runc is executed inside the user namespace, and the config.json does not specify the cgroup namespace to be unshared
   (e.g.., `(docker|podman|nerdctl) run --cgroupns=host`, with Rootless Docker/Podman/nerdctl)
2. or, when runc is executed outside the user namespace, and `/sys` is mounted with `rbind, ro`
   (e.g., `runc spec --rootless`; this condition is very rare)

A container may gain the write access to user-owned cgroup hierarchy `/sys/fs/cgroup/user.slice/...` on the host.
Other users's cgroup hierarchies are not affected.

To fix the issue, this commit does:
1. Remount `/sys/fs/cgroup` to apply `MS_RDONLY` when it is being bind-mounted
2. Mask `/sys/fs/cgroup` when the bind source is unavailable

Fix CVE-2023-25809 (GHSA-m8cg-xc2p-r3fc)

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-03-14 14:16:25 +09:00
Kir Kolyshkin
d370e3c046 libct: fix mounting via wrong proc fd
Due to a bug in commit 9c444070ec, when the user and mount namespaces
are used, and the bind mount is followed by the cgroup mount in the
spec, the cgroup is mounted using the bind mount's mount fd.

This can be reproduced with podman 4.1 (when configured to use runc):

$ podman run --uidmap 0💯10000 quay.io/libpod/testimage:20210610 mount
Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied

or manually with the spec mounts containing something like this:

    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/userdata/resolv.conf",
      "options": [
        "bind"
      ]
    },
    {
      "destination": "/sys/fs/cgroup",
      "type": "cgroup",
      "source": "cgroup",
      "options": [
        "rprivate",
        "nosuid",
        "noexec",
        "nodev",
        "relatime",
        "ro"
      ]
    }

The issue was not found earlier since it requires using userns, and even then
mount fd is ignored by mountToRootfs, except for bind mounts, and all the bind
mounts have mountfd set, except for the case of cgroup v1's /sys/fs/cgroup
which is internally transformed into a bunch of bind mounts.

This is a minimal fix for the issue, suitable for backporting.

A test case is added which reproduces the issue without the fix applied.

Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-06-16 11:54:42 -07:00
Irwin D'Souza
b76b6b9338 Allow mounting of /proc/sys/kernel/ns_last_pid
The CAP_CHECKPOINT_RESTORE linux capability provides the ability to
update /proc/sys/kernel/ns_last_pid. However, because this file is under
/proc, and by default both K8s and CRI-O specify that /proc/sys should
be mounted as Read-Only, by default even with the capability specified,
a process will not be able to write to ns_last_pid.

To get around this, a pod author can specify a volume mount and a
hostpath to bind-mount /proc/sys/kernel/ns_last_pid. However, runc does
not allow specifying mounts under /proc.

This commit adds /proc/sys/kernel/ns_last_pid to the validProcMounts
string array to enable a pod author to mount ns_last_pid as read-write.
The default remains unchanged; unless explicitly requested as a volume
mount, ns_last_pid will remain read-only regardless of whether or not
CAP_CHECKPOINT_RESTORE is specified.

Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>
2022-04-07 14:08:59 -04:00
Kir Kolyshkin
0fec1c2d8c libct: Mount: rm {Pre,Post}mountCmds
Those were added by commit 59c5c3ac0 back in Apr 2015, but AFAICS were
never used and are obsoleted by more generic container hooks (initially
added by commit 05567f2c94 in Sep 2015).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 15:51:55 -08:00
Akihiro Suda
382eba4354 Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...)
The new mount option "rro" makes the mount point recursively read-only,
by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`.
https://man7.org/linux/man-pages/man2/mount_setattr.2.html

Requires kernel >= 5.12.

The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501.

Fix issue 2823

Similary, this commit also adds the following mount options:
- rrw
- r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow}

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-12-07 17:39:57 +09:00
Aleksa Sarai
19d696ec29 merge branch 'pr-3276'
Kir Kolyshkin (2):
  runc run: fix ro /dev
  test/int/mount.bats: refer to github issue

LGTMs: thaJeztah cyphar
Closes #3276
2021-11-26 09:37:43 +11:00
Kir Kolyshkin
50105de1d8 Fix failure with rw bind mount of a ro fuse
As reported in [1], in a case where read-only fuse (sshfs) mount
is used as a volume without specifying ro flag, the kernel fails
to remount it (when adding various flags such as nosuid and nodev),
returning EPERM.

Here's the relevant strace line:

> [pid 333966] mount("/tmp/bats-run-PRVfWc/runc.RbNv8g/bundle/mnt", "/proc/self/fd/7", 0xc0001e9164, MS_NOSUID|MS_NODEV|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)

I was not able to reproduce it with other read-only mounts as the source
(tried tmpfs, read-only bind mount, and an ext2 mount), so somehow this
might be specific to fuse.

The fix is to check whether the source has RDONLY flag, and retry the
remount with this flag added.

A test case (which was kind of hard to write) is added, and it fails
without the fix. Note that rootless user need to be able to ssh to
rootless@localhost in order to sshfs to work -- amend setup scripts
to make it work, and skip the test if the setup is not working.

[1] https://github.com/containers/podman/issues/12205

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-18 13:09:41 -08:00
Kir Kolyshkin
b247cd392a runc run: fix ro /dev
Commit fb4c27c4b7 (went into v1.0.0-rc93) fixed a bug with
read-only tmpfs, but introduced a bug with read-only /dev.

This happens because /dev is a tmpfs mount and is therefore remounted
read-only a bit earlier than before.

To fix,

1. Revert the part of the above commit which remounts all tmpfs mounts
   as read-only in mountToRootfs.

2. Reuse finalizeRootfs (which is already used to remount /dev
   read-only) to also remount all ro tmpfs mounts that were previously
   mounted rw in mountPropagate.

3. Remove the break in finalizeRootfs, as now we have more than one
   mount to care about.

4. Reorder the if statements in finalizeRootfs to perform the fast check
   (for ro flag) first, and compare the strings second. Since /dev is
   most probably also a tmpfs mount, do the m.Device check first.

Add a test case to validate the fix and prevent future regressions;
make sure it fails before the fix:

 ✗ runc run [ro /dev mount]
   (in test file tests/integration/mounts.bats, line 45)
     `[ "$status" -eq 0 ]' failed
   runc spec (status=0):

   runc run test_busybox (status=1):
   time="2021-11-12T12:19:48-08:00" level=error msg="runc run failed: unable to start container process: error during container init: error mounting \"devpts\" to rootfs at \"/dev/pts\": mkdir /tmp/bats-run-VJXQk7/runc.0Fj70w/bundle/rootfs/dev/pts: read-only file system"

Fixes: fb4c27c4b7
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-15 10:37:16 -08:00
Kir Kolyshkin
7563a8f06d libct: wrap more unix errors
When I tried to start a rootless container under a different/wrong user,
I got:

	$ ../runc/runc --systemd-cgroup --root /tmp/runc.$$ run 445
	ERRO[0000] runc run failed: operation not permitted

This is obviously not good enough. With this commit, the error is:

	ERRO[0000] runc run failed: fchown fd 9: operation not permitted

Alas, there are still some code that returns unwrapped errnos from
various unix calls.

This is a followup to commit d8ba4128b2 which wrapped many, but not
all, bare unix errors. Do wrap some more, using either os.PathError or
os.SyscallError.

While at it,
 - use os.SyscallError instead of os.NewSyscallError;
 - use errors.Is(err, os.ErrXxx) instead of os.IsXxx(err).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 00:33:59 -08:00
Akihiro Suda
4d17654479 Merge pull request #2576 from kinvolk/alban/userns-2484-take2
Open bind mount sources from the host userns
2021-10-28 14:50:33 +09:00
Kir Kolyshkin
5516294172 Remove io/ioutil use
See https://golang.org/doc/go1.16#ioutil

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-14 13:46:02 -07:00
Alban Crequy
9c444070ec Open bind mount sources from the host userns
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:

  # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
  starting container process caused: process_linux.go:459:
  container init caused: rootfs_linux.go:58:
  mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
  to rootfs at \"/tmp/inaccessible\" caused:
  stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"

To solve this problem, this patch performs the following:

1. in nsexec.c, it opens the source path in the host userns (so we have
   the right permissions to open it) but in the container mntns (so the
   kernel cross mntns mount check let us mount it later:
   https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).

2. in nsexec.c, it passes the file descriptors of the source to the
   child process with SCM_RIGHTS.

3. In runc-init in Golang, it finishes the mounts while inside the
   userns even without access to the some components of the source
   paths.

Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.

This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.

This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.

Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).

This patch takes care of using O_CLOEXEC on mount fds, and close them
early.

Fixes: #2484.

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-10-12 15:13:45 +02:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Aleksa Sarai
09b80811f6 Revert "libct/devices: change devices.Type to be a string"
This reverts commit 814f3ae1d9. This
changed the on-disk state which breaks runc when it has to operate on
containers started with an older runc version. Working around this is
far more complicated than just reverting it.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-08-25 14:11:32 +10:00
Kir Kolyshkin
34df203d13 Merge pull request #3159 from thaJeztah/norunes
libct/devices: change devices.Type to be a string
2021-08-23 16:58:34 -07:00
Kir Kolyshkin
75761bccf7 Fix codespell warnings, add codespell to ci
The two exceptions I had to add to codespellrc are:
 - CLOS (used by intelrtd);
 - creat (syscall name used in tests/integration/testdata/seccomp_*.json).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-17 16:12:35 -07:00
Sebastiaan van Stijn
814f3ae1d9 libct/devices: change devices.Type to be a string
Possibly there was a specific reason to use a rune for this, but I noticed
that there's various parts in the code that has to convert values from a
string to this type. Using a string as type for this can simplify some of
that code.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-13 00:55:22 +02:00
Akihiro Suda
5547b5774f Merge pull request #3033 from kolyshkin/rm-own-errors
libcontainer: rm own error system
2021-07-01 13:47:27 +09:00
Kailun Qin
c508a7bc0a libct/rootfs: consolidate utils imports
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
2021-06-30 06:49:38 -04:00
Kir Kolyshkin
e918d02139 libcontainer: rm own error system
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.

Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.

While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".

A few things that are worth mentioning:

1. We lose stack traces (which were never shown anyway).

2. Users of libcontainer that relied on particular errors (like
   ContainerNotExists) need to switch to using errors.Is with
   the new errors defined in error.go.

3. encoding/json is unable to unmarshal the built-in error type,
   so we have to introduce initError and wrap the errors into it
   (basically passing the error as a string). This is the same
   as it was before, just a tad simpler (actually the initError
   is a type that got removed in commit afa844311; also suddenly
   ierr variable name makes sense now).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 10:21:04 -07:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
d8ba4128b2 libct/rootfs: improve some errors
Errors from os.Open, os.Symlink etc do not need to be wrapped, as they
are already wrapped into os.PathError.

Error from unix are bare errnos and need to be wrapped. Same
os.PathError is a good candidate.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
36aefad45d libct: wrap unix.Mount/Unmount errors
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.

Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:37 -07:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn
b45fbd43b8 errcheck: libcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:19:26 +02:00
Aleksa Sarai
0ca91f44f1 rootfs: add mount destination validation
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.

This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.

While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
container).

The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.

A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).

Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <champetier.etienne@gmail.com>
Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com>
Reviewed-by: Samuel Karp <skarp@amazon.com>
Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-19 16:58:35 +10:00
Kir Kolyshkin
ff692f289b Fix cgroup2 mount for rootless case
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440fb5 runc bind-mounts the whole
/sys/fs/cgroup into container.

Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.

The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.

Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.

Before:

	$ ./runc run aaa
	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup
	/sys/fs/cgroup/user.slice
	/sys/fs/cgroup/user.slice/user-1000.slice
	/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
	...
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 nobody   nogroup          0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	142 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory

After:

	# find /sys/fs/cgroup/ -type d
	/sys/fs/cgroup/
	# ls -l /sys/fs/cgroup/cgroup.controllers
	-r--r--r--    1 root     root             0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
	# wc -w /sys/fs/cgroup/cgroup.procs
	2 /sys/fs/cgroup/cgroup.procs
	# cat /sys/fs/cgroup/memory.current
	577536

[1] https://github.com/opencontainers/runc/issues/2158

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:35:40 -07:00
Kir Kolyshkin
3826db196d libct/rootfs/mountCgroupV2: minor refactor
1. s/cgroupPath/dest/

2. don't hardcode /sys/fs/cgroup

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:30:30 -07:00
Kir Kolyshkin
1e476578b6 libct/rootfs: introduce and use mountConfig
The code is already passing three parameters around from
mountToRootfs to mountCgroupV* to mountToRootfs again.

I am about to add another parameter, so let's introduce and
use struct mountConfig to pass around.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 12:30:06 -07:00
Kir Kolyshkin
a2050ea471 runc run: fix start for rootless + host pidns
Currently, runc fails like this when used from rootless podman
with host PID namespace:

> $ podman --runtime=runc run --pid=host --rm -it busybox sh
> WARN[0000] additional gid=10 is not present in the user namespace, skip setting it
> Error: container_linux.go:380: starting container process caused:
> process_linux.go:545: container init caused: readonly path /proc/asound:
> operation not permitted: OCI permission denied

(Here /proc/asound is the first path from OCI spec's readonlyPaths).

The code uses MS_BIND|MS_REMOUNT flags that have a special meaning in
the kernel ("keep the flags like nodev, nosuid, noexec as is").
For some reason, this "special meaning" trick is not working for the
above use case (rootless podman + no PID namespace), and I don't know
how to reproduce this without podman.

Instead of relying on the kernel feature, let's just get the current
mount flags using fstatfs(2) and add those that needs to be preserved.

While at it, wrap errors from unix.Mount into os.PathError to make
errors a bit less cryptic.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-14 17:32:08 -07:00
Sebastiaan van Stijn
4316df8b53 libcontainer/system: move userns utilities to separate package
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.

Looking at uses of these utilities (outside of runc itself);

`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.

[1]: https://grep.app/search?current=2&q=.RunningInUserNS

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-04 22:42:03 +02:00
Akihiro Suda
e7bd1fb10a Merge pull request #2717 from kolyshkin/check-proc-opt
libct/checkProcMounts: optimize
2021-01-29 17:32:45 +09:00
Kir Kolyshkin
692fab0936 libct/checkProcMounts: optimize
Commit 9c1242ecb ("Add white list for bind mount chec", Jan 6 2016)
added a set of entries under /proc which we allow to be mounted to,
for the benefit of lxcfs-like fuse-backed hack to have container's
own version of /proc/meminfo etc.

For some reason, the allow list check is performed at the very
beginning of the function, which is not optimal.

Move the check to the end -- at this point in the code we already
know we're under /proc, so it make sense to consult the allow list.

This makes the code slightly more logical and hopefully slightly faster.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-06 15:02:38 -08:00
Kir Kolyshkin
637f82d6b8 runc run: resolve tmpfs mount dest in container scope
In case a tmpfs mount path contains absolute symlinks, runc errors out
because those symlinks are resolved in the host (rather than container)
filesystem scope.

The fix is similar to that for bind mounts -- resolve the destination
in container rootfs scope using securejoin, and use the resolved path.

A simple integration test case is added to prevent future regressions.

Fixes https://github.com/opencontainers/runc/issues/2683.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-06 14:57:02 -08:00