Commit Graph

132 Commits

Author SHA1 Message Date
Kir Kolyshkin
692fab0936 libct/checkProcMounts: optimize
Commit 9c1242ecb ("Add white list for bind mount chec", Jan 6 2016)
added a set of entries under /proc which we allow to be mounted to,
for the benefit of lxcfs-like fuse-backed hack to have container's
own version of /proc/meminfo etc.

For some reason, the allow list check is performed at the very
beginning of the function, which is not optimal.

Move the check to the end -- at this point in the code we already
know we're under /proc, so it make sense to consult the allow list.

This makes the code slightly more logical and hopefully slightly faster.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-06 15:02:38 -08:00
Feng Sun
48b8eb0952 checkProcMount: add /proc/slabinfo to whitelist
With lxcfs commit, slabinfo should can be mounted:
"proc_fuse: add /proc/slabinfo with slab accounting memcg"
https://github.com/lxc/lxcfs/commit/1cc68c8bfa

Signed-off-by: Feng Sun <loyou85@gmail.com>
2020-12-16 09:40:04 +08:00
Sebastiaan van Stijn
677baf22d2 libcontainer: isolate libcontainer/devices
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-01 11:11:21 +01:00
Giuseppe Scrivano
41aa764010 linux: drop MS_REC for readonly remount
it has no effect.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-11-06 11:36:50 +01:00
Giuseppe Scrivano
a4e6955e31 linux: fix remount readonly in a user namespace
if we are remounting root read only when in a user namespace, make
sure the existing flags (e.g. MS_NOEXEC, MS_NODEV) are maintained
otherwise the mount fails with EPERM.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-11-06 11:35:40 +01:00
Akihiro Suda
9d4c02cf29 Merge pull request #2570 from EduardoVega/2246-fix-chmod-ro-tmpfs-mount
Fix mount error when chmod RO tmpfs
2020-10-26 18:19:16 +09:00
Aleksa Sarai
b8bf572812 rootfs: handle nested procfs mounts for MS_MOVE
In a case where the host /proc mount has already been overmounted, the
MS_MOVE handling would get ENOENT when trying to hide (for instance)
"/proc/bus" because it had already hidden away "/proc". This revealed
two issues in the previous implementation of this hardening feaure:

1. No checks were done to make sure the mount was a "full" mount (it is
   a mount of the root of the filesystem), but the kernel doesn't permit
   a non-full mount to be converted to a full mount (for reference, see
   mnt_already_visible). This just removes extra busy-work during setup.

2. ENOENT was treated as a critical error, even though it actually
   indicates the mount doesn't exist and thus isn't a problem. A more
   theoretically pure solution would be to store the set of mountpoints
   to be hidden and only ignore the error if an ancestor directory of
   the current mountpoint was already hidden, but that would just add
   complexity with little justification.

In addition, better document the reasoning behind this logic so that
folks aren't confused when looking at it.

Fixes: 28a697cce3 ("rootfs: umount all procfs and sysfs with --no-pivot")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2020-10-13 20:54:56 +11:00
Eduardo Vega
fb4c27c4b7 Fix mount error when chmod RO tmpfs
Signed-off-by: Eduardo Vega <edvegavalerio@gmail.com>
2020-10-05 21:23:30 -06:00
Kir Kolyshkin
87412ee435 vendor: bump mountinfo v0.3.1
It contains some breaking changes, so fix the code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-10-01 18:51:25 -07:00
Akihiro Suda
e5f2eae5a5 Merge pull request #2558 from rhatdan/windows
Since no kernels support direct labeling of /dev/mqueue remove label
2020-08-22 04:43:36 +09:00
Daniel J Walsh
0445fd60a4 Since no kernels support direct labeling of /dev/mqueue remove label
This looks like this is just filling logs for years, since the kernel never
added the support for automatically labeling /dev/mqueue.

Removes these dmesg lines

[ 1731.969847] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1736.985146] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.356796] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.479952] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.628935] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1763.433276] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1806.802133] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1806.982003] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1808.955390] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1815.951076] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1827.257757] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1828.947888] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1834.964451] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1835.941465] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2020-08-20 13:56:19 -04:00
Giuseppe Scrivano
a63f99fcc5 Add support for umask
Signed-off-by: Ashley Cui <acui@redhat.com>
2020-08-20 11:39:43 -04:00
Aleksa Sarai
2265daa55b merge branch 'pr-2522' into master
Cesar Talledo (2):
  Remove runc default devices that overlap with spec devices.
  Skip redundant setup for /dev/ptmx when specified explicitly in the OCI spec.

LGTMs: @AkihiroSuda @cyphar
Closes #2522
2020-08-19 16:58:23 +10:00
Cesar Talledo
9a699e1a9f Skip redundant setup for /dev/ptmx when specified explicitly in the OCI spec.
Per the OCI spec, /dev/ptmx is always a symlink to /dev/pts/ptmx. As such, if
the OCI spec has an explicit entry for /dev/ptmx, runc shall ignore it.

This change ensures this is the case. A integration test was also added
(in tests/integration/dev.bats).

Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>
2020-08-07 16:46:26 -07:00
Sebastiaan van Stijn
901dccf05d vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-30 22:08:54 +02:00
Xiaodong Liu
af283b3f47 remove redundant the parameter of chroot function
Signed-off-by: Xiaodong Liu <liuxiaodong@loongson.cn>
2020-07-15 16:22:07 +08:00
Renaud Gaubert
ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Aleksa Sarai
24388be71e configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai
b2bec9806f cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Sebastiaan van Stijn
64ca54816c libcontainer: simplify error message
The error message was including both the rootfs path, and the full
mount path, which also includes the path of the rootfs.

This patch removes the rootfs path from the error message, as it
was redundant, and made the error message overly verbose

Before this patch (errors wrapped for readability):

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

With this patch applied:

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-03 02:59:46 +02:00
Kir Kolyshkin
55d5c99ca7 libct/mountToRootfs: rm useless code
To make a bind mount read-only, it needs to be remounted. This is what
the code removed does, but it is not needed here.

We have to deal with three cases here:

1. cgroup v2 unified mode. In this case the mount is real mount with
   fstype=cgroup2, and there is no need to have a bind mount on top,
   as we pass readonly flag to the mount as is.

2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the
   "mount" is in fact a set of real mounts with fstype=cgroup, and
   they are all performed in mountCgroupV1, with readonly flag
   added if needed.

3. cgroup v1 as is (enableCgroupns == false). In this case
   mountCgroupV1() calls mountToRootfs() again with an argument
   from the list obtained from getCgroupMounts(), i.e. a bind
   mount with the same flags as the original mount has (plus
   unix.MS_BIND | unix.MS_REC), and mountToRootfs() does remounting
   (under the case "bind":).

So, the code which this patch is removing is not needed -- it
essentially does nothing in case 3 above (since the bind mount
is already remounted readonly), and in cases 1 and 2 it
creates an unneeded extra bind mount on top of a real one (or set of
real ones).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-23 16:49:12 -07:00
Kir Kolyshkin
9280e3566d checkpoint/restore: fix cgroupv2 handling
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).

So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.

Without this fix, checkpoint integration tests fail on cgroup v2.

Also, same is true for cgroup v1 + cgroupns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:26:43 -07:00
Kir Kolyshkin
dd7b34618f libct/msMoveRoot: benefit from GetMounts filter
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin
fc4357a8b0 libct/msMoveRoot: rm redundant filepath.Abs() calls
1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs()

2. mount points from /proc/self/mountinfo are absolute and clean, too

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin
dce0de8975 getParentMount: benefit from GetMounts filter
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin
c7ab2c036b libcontainer: switch to moby/sys/mountinfo package
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Aleksa Sarai
3291d66b98 rootfs: do not permit /proc mounts to non-directories
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).

This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.

Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-01-17 14:00:30 +11:00
Akihiro Suda
9c81440fb5 cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.

This behavior correspond to crun v0.10.2.

Fix #2158

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-27 23:09:41 +09:00
Michael Crosby
331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
Giuseppe Scrivano
718a566e02 cgroup: support mount of cgroup2
convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2
unified hierarchy.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-06 17:57:14 +02:00
Adrian Reber
f661e02343 factor out bind mount mountpoint creation
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Giuseppe Scrivano
28a697cce3 rootfs: umount all procfs and sysfs with --no-pivot
When creating a new user namespace, the kernel doesn't allow to mount
a new procfs or sysfs file system if there is not already one instance
fully visible in the current mount namespace.

When using --no-pivot we were effectively inhibiting this protection
from the kernel, as /proc and /sys from the host are still present in
the container mount namespace.

A container without full access to /proc could then create a new user
namespace, and from there able to mount a fully visible /proc, bypassing
the limitations in the container.

A simple reproducer for this issue is:

unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger"

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-01-14 09:53:35 +01:00
Mrunal Patel
4769cdf607 Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Yuanhong Peng
df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Dominik Süß
0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Mrunal Patel
9cda583235 Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot
linux: drop check for /proc as invalid dest
2018-09-04 09:26:21 -07:00
ChangFeng
3ce8fac7c4 libcontainer: add /proc/loadavg to the white list of bind mount
Signed-off-by: JunLi <lijun.git@gmail.com>
2018-08-30 21:30:23 +08:00
Giuseppe Scrivano
636b664027 linux: drop check for /proc as invalid dest
it is now allowed to bind mount /proc.  This is useful for rootless
containers when the PID namespace is shared with the host.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-08-30 09:56:18 +02:00
Daniel J Walsh
62a4763a7a When doing a copyup, /tmp can not be a shared mount point
MOVE_MOUNT will fail under certain situations.

You are not allowed to MS_MOVE if the parent directory is shared.

man mount
...
   The move operation
       Move a mounted tree to another place (atomically).  The call is:

              mount --move olddir newdir

       This  will cause the contents which previously appeared under olddir to
       now be accessible under newdir.  The physical location of the files  is
       not changed.  Note that olddir has to be a mountpoint.

       Note  also that moving a mount residing under a shared mount is invalid
       and unsupported.  Use findmnt -o TARGET,PROPAGATION to see the  current
       propagation flags.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-08-20 17:41:06 -04:00
Mrunal Patel
26ec8a9783 Revert "libcontainer/rootfs_linux: minor cleanup"
This reverts commit 1b27db67f1.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-14 15:50:18 -07:00
Bin Chen
1b27db67f1 libcontainer/rootfs_linux: minor cleanup
move variable close to where is used

Signed-off-by: Bin Chen <nk@devicu.com>
2018-04-16 22:25:48 +10:00
Daniel J Walsh
43aea05946 Label the masked tmpfs with the mount label
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels.  Adding
the mountlabel will remove these AVC's.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-03-09 14:29:06 -05:00
Michael Crosby
91ca331474 chroot when no mount namespaces is provided
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-25 11:36:37 -05:00
Vincent Demeester
3ca4c78b1a Import docker/docker/pkg/mount into runc
This will help get rid of docker/docker dependency in runc 👼

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-11-08 16:25:58 +01:00
Vincent Demeester
594501475e Use cyphar/filepath-securejoin instead of docker pkg/symlink
runc shouldn't depend on docker and be more self-contained.
Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-10-31 16:53:45 +01:00
Aleksa Sarai
2430a98e64 merge branch 'pr-1500'
rootfs: switch ms_private remount of oldroot to ms_slave

LGTMs: @crosbymichael @hqhq
Closes opencontainers/runc#1500
2017-10-14 09:32:59 +11:00
Akihiro Suda
2edd36fdff libcontainer: create Cwd when it does not exist
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-10-05 05:31:46 +00:00
Tycho Andersen
66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Aleksa Sarai
117c92745b rootfs: switch ms_private remount of oldroot to ms_slave
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-06-29 01:20:23 +10:00
Christy Perez
3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00