zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-21 14:39:36 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	692fab0936	libct/checkProcMounts: optimize Commit `9c1242ecb` ("Add white list for bind mount chec", Jan 6 2016) added a set of entries under /proc which we allow to be mounted to, for the benefit of lxcfs-like fuse-backed hack to have container's own version of /proc/meminfo etc. For some reason, the allow list check is performed at the very beginning of the function, which is not optimal. Move the check to the end -- at this point in the code we already know we're under /proc, so it make sense to consult the allow list. This makes the code slightly more logical and hopefully slightly faster. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-01-06 15:02:38 -08:00
Feng Sun	48b8eb0952	checkProcMount: add /proc/slabinfo to whitelist With lxcfs commit, slabinfo should can be mounted: "proc_fuse: add /proc/slabinfo with slab accounting memcg" https://github.com/lxc/lxcfs/commit/1cc68c8bfa Signed-off-by: Feng Sun <loyou85@gmail.com>	2020-12-16 09:40:04 +08:00
Sebastiaan van Stijn	677baf22d2	libcontainer: isolate libcontainer/devices Move the Device-related types to libcontainer/devices, so that the package can be used in isolation. Aliases have been created in libcontainer/configs for backward compatibility. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:21 +01:00
Giuseppe Scrivano	41aa764010	linux: drop MS_REC for readonly remount it has no effect. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2020-11-06 11:36:50 +01:00
Giuseppe Scrivano	a4e6955e31	linux: fix remount readonly in a user namespace if we are remounting root read only when in a user namespace, make sure the existing flags (e.g. MS_NOEXEC, MS_NODEV) are maintained otherwise the mount fails with EPERM. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2020-11-06 11:35:40 +01:00
Akihiro Suda	9d4c02cf29	Merge pull request #2570 from EduardoVega/2246-fix-chmod-ro-tmpfs-mount Fix mount error when chmod RO tmpfs	2020-10-26 18:19:16 +09:00
Aleksa Sarai	b8bf572812	rootfs: handle nested procfs mounts for MS_MOVE In a case where the host /proc mount has already been overmounted, the MS_MOVE handling would get ENOENT when trying to hide (for instance) "/proc/bus" because it had already hidden away "/proc". This revealed two issues in the previous implementation of this hardening feaure: 1. No checks were done to make sure the mount was a "full" mount (it is a mount of the root of the filesystem), but the kernel doesn't permit a non-full mount to be converted to a full mount (for reference, see mnt_already_visible). This just removes extra busy-work during setup. 2. ENOENT was treated as a critical error, even though it actually indicates the mount doesn't exist and thus isn't a problem. A more theoretically pure solution would be to store the set of mountpoints to be hidden and only ignore the error if an ancestor directory of the current mountpoint was already hidden, but that would just add complexity with little justification. In addition, better document the reasoning behind this logic so that folks aren't confused when looking at it. Fixes: `28a697cce3` ("rootfs: umount all procfs and sysfs with --no-pivot") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2020-10-13 20:54:56 +11:00
Eduardo Vega	fb4c27c4b7	Fix mount error when chmod RO tmpfs Signed-off-by: Eduardo Vega <edvegavalerio@gmail.com>	2020-10-05 21:23:30 -06:00
Kir Kolyshkin	87412ee435	vendor: bump mountinfo v0.3.1 It contains some breaking changes, so fix the code. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-10-01 18:51:25 -07:00
Akihiro Suda	e5f2eae5a5	Merge pull request #2558 from rhatdan/windows Since no kernels support direct labeling of /dev/mqueue remove label	2020-08-22 04:43:36 +09:00
Daniel J Walsh	0445fd60a4	Since no kernels support direct labeling of /dev/mqueue remove label This looks like this is just filling logs for years, since the kernel never added the support for automatically labeling /dev/mqueue. Removes these dmesg lines [ 1731.969847] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1736.985146] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1738.356796] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1738.479952] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1738.628935] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1763.433276] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1806.802133] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1806.982003] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1808.955390] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1815.951076] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1827.257757] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1828.947888] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1834.964451] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 1835.941465] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2020-08-20 13:56:19 -04:00
Giuseppe Scrivano	a63f99fcc5	Add support for umask Signed-off-by: Ashley Cui <acui@redhat.com>	2020-08-20 11:39:43 -04:00
Aleksa Sarai	2265daa55b	merge branch 'pr-2522' into master Cesar Talledo (2): Remove runc default devices that overlap with spec devices. Skip redundant setup for /dev/ptmx when specified explicitly in the OCI spec. LGTMs: @AkihiroSuda @cyphar Closes #2522	2020-08-19 16:58:23 +10:00
Cesar Talledo	9a699e1a9f	Skip redundant setup for /dev/ptmx when specified explicitly in the OCI spec. Per the OCI spec, /dev/ptmx is always a symlink to /dev/pts/ptmx. As such, if the OCI spec has an explicit entry for /dev/ptmx, runc shall ignore it. This change ensures this is the case. A integration test was also added (in tests/integration/dev.bats). Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>	2020-08-07 16:46:26 -07:00
Sebastiaan van Stijn	901dccf05d	vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6 Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-07-30 22:08:54 +02:00
Xiaodong Liu	af283b3f47	remove redundant the parameter of chroot function Signed-off-by: Xiaodong Liu <liuxiaodong@loongson.cn>	2020-07-15 16:22:07 +08:00
Renaud Gaubert	ccdd75760c	Add the CreateRuntime, CreateContainer and StartContainer Hooks Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2020-06-17 02:10:00 +00:00
Aleksa Sarai	24388be71e	configs: use different types for .Devices and .Resources.Devices Making them the same type is simply confusing, but also means that you could accidentally use one in the wrong context. This eliminates that problem. This also includes a whole bunch of cleanups for the types within DeviceRule, so that they can be used more ergonomically. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Aleksa Sarai	b2bec9806f	cgroup: devices: eradicate the Allow/Deny lists These lists have been in the codebase for a very long time, and have been unused for a large portion of that time -- specconv doesn't generate them and the only user of these flags has been tests (which doesn't inspire much confidence). In addition, we had an incorrect implementation of a white-list policy. This wasn't exploitable because all of our users explicitly specify "deny all" as the first rule, but it was a pretty glaring issue that came from the "feature" that users can select whether they prefer a white- or black- list. Fix this by always writing a deny-all rule (which is what our users were doing anyway, to work around this bug). This is one of many changes needed to clean up the devices cgroup code. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Sebastiaan van Stijn	64ca54816c	libcontainer: simplify error message The error message was including both the rootfs path, and the full mount path, which also includes the path of the rootfs. This patch removes the rootfs path from the error message, as it was redundant, and made the error message overly verbose Before this patch (errors wrapped for readability): ``` container_linux.go:348: starting container process caused: process_linux.go:438: container init caused: rootfs_linux.go:58: mounting "/foo.txt" to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged" at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html" caused: not a directory: unknown ``` With this patch applied: ``` container_linux.go:348: starting container process caused: process_linux.go:438: container init caused: rootfs_linux.go:58: mounting "/foo.txt" to rootfs at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html" caused: not a directory: unknown ``` Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-05-03 02:59:46 +02:00
Kir Kolyshkin	55d5c99ca7	libct/mountToRootfs: rm useless code To make a bind mount read-only, it needs to be remounted. This is what the code removed does, but it is not needed here. We have to deal with three cases here: 1. cgroup v2 unified mode. In this case the mount is real mount with fstype=cgroup2, and there is no need to have a bind mount on top, as we pass readonly flag to the mount as is. 2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the "mount" is in fact a set of real mounts with fstype=cgroup, and they are all performed in mountCgroupV1, with readonly flag added if needed. 3. cgroup v1 as is (enableCgroupns == false). In this case mountCgroupV1() calls mountToRootfs() again with an argument from the list obtained from getCgroupMounts(), i.e. a bind mount with the same flags as the original mount has (plus unix.MS_BIND \| unix.MS_REC), and mountToRootfs() does remounting (under the case "bind":). So, the code which this patch is removing is not needed -- it essentially does nothing in case 3 above (since the bind mount is already remounted readonly), and in cases 1 and 2 it creates an unneeded extra bind mount on top of a real one (or set of real ones). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-04-23 16:49:12 -07:00
Kir Kolyshkin	9280e3566d	checkpoint/restore: fix cgroupv2 handling In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount is the real mount with fstype of cgroup2 (rather than a set of external bind mounts like for cgroupv1). So, we should not add it to the list of "external bind mounts" on both checkpoint and restore. Without this fix, checkpoint integration tests fail on cgroup v2. Also, same is true for cgroup v1 + cgroupns. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-04-22 11:26:43 -07:00
Kir Kolyshkin	dd7b34618f	libct/msMoveRoot: benefit from GetMounts filter Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-03-21 10:33:43 -07:00
Kir Kolyshkin	fc4357a8b0	libct/msMoveRoot: rm redundant filepath.Abs() calls 1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs() 2. mount points from /proc/self/mountinfo are absolute and clean, too Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-03-21 10:33:43 -07:00
Kir Kolyshkin	dce0de8975	getParentMount: benefit from GetMounts filter Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-03-21 10:33:43 -07:00
Kir Kolyshkin	c7ab2c036b	libcontainer: switch to moby/sys/mountinfo package Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo, which is fast mountinfo parser. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-03-21 10:33:43 -07:00
Aleksa Sarai	3291d66b98	rootfs: do not permit /proc mounts to non-directories mount(2) will blindly follow symlinks, which is a problem because it allows a malicious container to trick runc into mounting /proc to an entirely different location (and thus within the attacker's control for a rename-exchange attack). This is just a hotfix (to "stop the bleeding"), and the more complete fix would be finish libpathrs and port runc to it (to avoid these types of attacks entirely, and defend against a variety of other /proc-related attacks). It can be bypased by someone having "/" be a volume controlled by another container. Fixes: CVE-2019-19921 Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-01-17 14:00:30 +11:00
Akihiro Suda	9c81440fb5	cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared, because we cannot mount cgroup2. This behavior correspond to crun v0.10.2. Fix #2158 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-27 23:09:41 +09:00
Michael Crosby	331692baa7	Only allow proc mount if it is procfs Fixes #2128 This allows proc to be bind mounted for host and rootless namespace usecases but it removes the ability to mount over the top of proc with a directory. ```bash > sudo docker run --rm apparmor docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\" to rootfs \\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\" at \\\"/proc\\\" caused \\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\" cannot be mounted because it is not of type proc\\\"\"": unknown. > sudo docker run --rm -v /proc:/proc apparmor docker-default (enforce) root 18989 0.9 0.0 1288 4 ? Ss 16:47 0:00 sleep 20 ``` Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2019-09-24 11:00:18 -04:00
Giuseppe Scrivano	718a566e02	cgroup: support mount of cgroup2 convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2 unified hierarchy. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-06 17:57:14 +02:00
Adrian Reber	f661e02343	factor out bind mount mountpoint creation During rootfs setup all mountpoints (directory and files) are created before bind mounting the bind mounts. This does not happen during container restore via CRIU. If restoring in an identical but newly created rootfs, the restore fails right now. This just factors out the code to create the bind mount mountpoints so that it also can be used during restore. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-08 15:59:51 +01:00
Giuseppe Scrivano	28a697cce3	rootfs: umount all procfs and sysfs with --no-pivot When creating a new user namespace, the kernel doesn't allow to mount a new procfs or sysfs file system if there is not already one instance fully visible in the current mount namespace. When using --no-pivot we were effectively inhibiting this protection from the kernel, as /proc and /sys from the host are still present in the container mount namespace. A container without full access to /proc could then create a new user namespace, and from there able to mount a fully visible /proc, bypassing the limitations in the container. A simple reproducer for this issue is: unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-14 09:53:35 +01:00
Mrunal Patel	4769cdf607	Merge pull request #1916 from crosbymichael/cgns Add support for cgroup namespace	2018-11-13 12:21:38 -08:00
Yuanhong Peng	df3fa115f9	Add support for cgroup namespace Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-31 10:51:43 -04:00
Dominik Süß	0b412e9482	various cleanups to address linter issues Signed-off-by: Dominik Süß <dominik@suess.wtf>	2018-10-13 21:14:03 +02:00
Mrunal Patel	9cda583235	Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot linux: drop check for /proc as invalid dest	2018-09-04 09:26:21 -07:00
ChangFeng	3ce8fac7c4	libcontainer: add /proc/loadavg to the white list of bind mount Signed-off-by: JunLi <lijun.git@gmail.com>	2018-08-30 21:30:23 +08:00
Giuseppe Scrivano	636b664027	linux: drop check for /proc as invalid dest it is now allowed to bind mount /proc. This is useful for rootless containers when the PID namespace is shared with the host. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-08-30 09:56:18 +02:00
Daniel J Walsh	62a4763a7a	When doing a copyup, /tmp can not be a shared mount point MOVE_MOUNT will fail under certain situations. You are not allowed to MS_MOVE if the parent directory is shared. man mount ... The move operation Move a mounted tree to another place (atomically). The call is: mount --move olddir newdir This will cause the contents which previously appeared under olddir to now be accessible under newdir. The physical location of the files is not changed. Note that olddir has to be a mountpoint. Note also that moving a mount residing under a shared mount is invalid and unsupported. Use findmnt -o TARGET,PROPAGATION to see the current propagation flags. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2018-08-20 17:41:06 -04:00
Mrunal Patel	26ec8a9783	Revert "libcontainer/rootfs_linux: minor cleanup" This reverts commit `1b27db67f1`. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-08-14 15:50:18 -07:00
Bin Chen	1b27db67f1	libcontainer/rootfs_linux: minor cleanup move variable close to where is used Signed-off-by: Bin Chen <nk@devicu.com>	2018-04-16 22:25:48 +10:00
Daniel J Walsh	43aea05946	Label the masked tmpfs with the mount label Currently if a confined container process tries to list these directories AVC's are generated because they are labeled with external labels. Adding the mountlabel will remove these AVC's. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2018-03-09 14:29:06 -05:00
Michael Crosby	91ca331474	chroot when no mount namespaces is provided Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-01-25 11:36:37 -05:00
Vincent Demeester	3ca4c78b1a	Import docker/docker/pkg/mount into runc This will help get rid of docker/docker dependency in runc 👼 Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-11-08 16:25:58 +01:00
Vincent Demeester	594501475e	Use cyphar/filepath-securejoin instead of docker pkg/symlink runc shouldn't depend on docker and be more self-contained. Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-10-31 16:53:45 +01:00
Aleksa Sarai	2430a98e64	merge branch 'pr-1500' rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes opencontainers/runc#1500	2017-10-14 09:32:59 +11:00
Akihiro Suda	2edd36fdff	libcontainer: create Cwd when it does not exist The benefit for doing this within runc is that it works well with userns. Actually, runc already does the same thing for mount points. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-10-05 05:31:46 +00:00
Tycho Andersen	66eb2a3e8f	fix --read-only containers under --userns-remap The documentation here: https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations says that readonly containers can't be used with user namespaces do to some kernel restriction. In fact, there is a special case in the kernel to be able to do stuff like this, so let's use it. This takes us from: ubuntu@docker:~$ docker run -it --read-only ubuntu docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"". to: ubuntu@docker:~$ docker-runc --version runc version 1.0.0-rc4+dev commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty spec: 1.0.0 ubuntu@docker:~$ docker run -it --read-only ubuntu root@181e2acb909a:/# touch foo touch: cannot touch 'foo': Read-only file system Signed-off-by: Tycho Andersen <tycho@docker.com>	2017-08-24 16:43:21 -06:00
Aleksa Sarai	117c92745b	rootfs: switch ms_private remount of oldroot to ms_slave Using MS_PRIVATE meant that there was a race between the mount(2) and the umount2(2) calls where runc inadvertently has a live reference to a mountpoint that existed on the host (which the host cannot kill implicitly through an unmount and peer sharing). In particular, this means that if we have a devicemapper mountpoint and the host is trying to delete the underlying device, the delete will fail because it is "in use" during the race. While the race is _very_ small (and libdm actually retries to avoid these sorts of cases) this appears to manifest in various cases. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-06-29 01:20:23 +10:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00

1 2 3

132 Commits