zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-05 23:46:57 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	f26ec92221	libct: rm Rootless* properties from initConfig They are passed in initConfig twice, so it does not make sense. NB: the alternative to that would be to remove Config field from initConfig, but it results in a much bigger patch and more maintenance down the road. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	6c9ddcc648	libct: switch from libct/devices to libct/cgroups/devices/config Use the old package name as an alias to minimize the patch. No functional change; this just eliminates a bunch of deprecation warnings. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-31 16:51:09 -08:00
Kir Kolyshkin	93091e6ac2	libct: don't pass SpecState to init unless needed SpecState field of initConfig is only needed to run hooks that are executed inside a container -- namely CreateContainer and StartContainer. If these hooks are not configured, there is no need to fill, marshal and unmarshal SpecState. While at it, inline updateSpecState as it is trivial and only has one user. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:52:15 -08:00
Sebastiaan van Stijn	9b60a93cf3	libcontainer/userns: migrate to github.com/moby/sys/userns The userns package was moved to the moby/sys/userns module at commit `3778ae603c`. This patch deprecates the old location, and adds it as an alias for the moby/sys/userns package. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-10-09 22:20:25 +08:00
Kir Kolyshkin	13a6f56097	runc run: fix mount leak When preparing to mount container root, we need to make its parent mount private (i.e. disable propagation), otherwise the new in-container mounts are leaked to the host. To find a parent mount, we use to read mountinfo and find the longest entry which can be a parent of the container root directory. Unfortunately, due to kernel bug in all Linux kernels older than v5.8 (see [1], [2]), sometimes mountinfo can't be read in its entirety. In this case, getParentMount may occasionally return a wrong parent mount. As a result, we do not change the mount propagation to private, and container mounts are leaked. Alas, we can not fix the kernel, and reading mountinfo a few times to ensure its consistency (like it's done in, say, Kubernetes) does not look like a good solution for performance reasons. Fortunately, we don't need mountinfo. Let's just traverse the directory tree, trying to remount it private until we find a mount point (any error other than EINVAL means we just found it). Fixes issue 2404. [1]: https://github.com/kolyshkin/procfs-test [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f6c61f96f2d97cbb5f Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-10-02 13:58:27 -07:00
Aleksa Sarai	63c2908164	rootfs: try to scope MkdirAll to stay inside the rootfs While we use SecureJoin to try to make all of our target paths inside the container safe, SecureJoin is not safe against an attacker than can change the path after we "resolve" it. os.MkdirAll can inadvertently follow symlinks and thus an attacker could end up tricking runc into creating empty directories on the host (note that the container doesn't get access to these directories, and the host just sees empty directories). However, this could potentially cause DoS issues by (for instance) creating a directory in a conf.d directory for a daemon that doesn't handle subdirectories properly. In addition, the handling for creating file bind-mounts did a plain open(O_CREAT) on the SecureJoin'd path, which is even more obviously unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an attacker to cause data loss...). Regardless of the symlink issue, opening an untrusted file could result in a DoS if the file is a hung tty or some other "nasty" file. We can use mknodat to safely create a regular file without opening anything anyway (O_CREAT\|O_EXCL would also work but it makes the logic a bit more complicated, and we don't want to open the file for any particular reason anyway). libpathrs[1] is the long-term solution for these kinds of problems, but for now we can patch this particular issue by creating a more restricted MkdirAll that refuses to resolve symlinks and does the creation using file descriptors. This is loosely based on a more secure version that filepath-securejoin now has[2] and will be added to libpathrs soon[3]. [1]: https://github.com/openSUSE/libpathrs [2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0 [3]: https://github.com/openSUSE/libpathrs/issues/10 Fixes: CVE-2024-45310 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-09-03 02:34:13 +10:00
Aleksa Sarai	1410a6988d	rootfs: consolidate mountpoint creation logic The logic for how we create mountpoints is spread over each mountpoint preparation function, when in reality the behaviour is pretty uniform with only a handful of exceptions. So just move it all to one function that is easier to understand. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-07-25 14:16:05 +10:00
Sohan Kunkerkar	cde1d0908a	libcontainer: force apps to think fips is enabled/disabled for testing The motivation behind this change is to provide a flexible mechanism for containers within a Kubernetes cluster to opt out of FIPS mode when necessary. This change enables apps to simulate FIPS mode being enabled or disabled for testing purposes. Users can control whether apps believe FIPS mode is on or off by manipulating `/proc/sys/crypto/fips_enabled`. Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>	2024-04-10 18:58:34 -04:00
Aleksa Sarai	cdff09ab87	rootfs: fix 'can we mount on top of /proc' check Our previous test for whether we can mount on top of /proc incorrectly assumed that it would only be called with bind-mount sources. This meant that having a non bind-mount entry for a pseudo-filesystem (like overlayfs) with a dummy source set to /proc on the host would let you bypass the check, which could easily lead to security issues. In addition, the check should be applied more uniformly to all mount types, so fix that as well. And add some tests for some of the tricky cases to make sure we protect against them properly. Fixes: `331692baa7` ("Only allow proc mount if it is procfs") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:42 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Aleksa Sarai	7c71a22705	rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT The original reasoning for this option was to avoid having mount options be overwritten by runc. However, adding command-line arguments has historically been a bad idea because it forces strict-runc-compatible OCI runtimes to copy out-of-spec features directly from runc and these flags are usually quite difficult to enable by users when using runc through several layers of engines and orchestrators. A far more preferable solution is to have a heuristic which detects whether copying the original mount's mount options would override an explicit mount option specified by the user. In this case, we should return an error. You only end up in this path in the userns case, if you have a bind-mount source with locked flags. During the course of writing this patch, I discovered that several aspects of our handling of flags for bind-mounts left much to be desired. We have completely botched the handling of explicitly cleared flags since commit `97f5ee4e6a` ("Only remount if requested flags differ from current"), with our behaviour only becoming increasingly more weird with `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") and `da780e4d27` ("Fix bind mounts of filesystems with certain options set"). In short, we would only clear flags explicitly request by the user purely by chance, in ways that it really should've been reported to us by now. The most egregious is that mounts explicitly marked "rw" were actually mounted "ro" if the bind-mount source was "ro" and no other special flags were included. In addition, our handling of atime was completely broken -- mostly due to how subtle the semantics of atime are on Linux. Unfortunately, while the runtime-spec requires us to implement mount(8)'s behaviour, several aspects of the util-linux mount(8)'s behaviour are broken and thus copying them makes little sense. Since the runtime-spec behaviour for this case (should mount options for a "bind" mount use the "mount --bind -o ..." or "mount --bind -o remount,..." semantics? Is the fallback code we have for userns actually spec-compliant?) and the mount(8) behaviour (see [1]) are not well-defined, this commit simply fixes the most obvious aspects of the behaviour that are broken while keeping the current spirit of the implementation. NOTE: The handling of atime in the base case is left for a future PR to deal with. This means that the atime of the source mount will be silently left alone unless the fallback path needs to be taken, and any flags not explicitly set will be cleared in the base case. Whether we should always be operating as "mount --bind -o remount,..." (where we default to the original mount source flags) is a topic for a separate PR and (probably) associated runtime-spec PR. So, to resolve this: * We store which flags were explicitly requested to be cleared by the user, so that we can detect whether the userns fallback path would end up setting a flag the user explicitly wished to clear. If so, we return an error because we couldn't fulfil the configuration settings. * Revert `97f5ee4e6a` ("Only remount if requested flags differ from current"), as missing flags do not mean we can skip MS_REMOUNT (in fact, missing flags are how you indicate a flag needs to be cleared with mount(2)). The original purpose of the patch was to fix the userns issue, but as mentioned above the correct mechanism is to do a fallback mount that copies the lockable flags from statfs(2). * Improve handling of atime in the fallback case by: - Correctly handling the returned flags in statfs(2). - Implement the MNT_LOCK_ATIME checks in our code to ensure we produce errors rather than silently producing incorrect atime mounts. * Improve the tests so we correctly detect all of these contingencies, including a general "bind-mount atime handling" test to ensure that the behaviour described here is accurate. This change also inlines the remount() function -- it was only ever used for the bind-mount remount case, and its behaviour is very bind-mount specific. [1]: https://github.com/util-linux/util-linux/issues/2433 Reverts: `97f5ee4e6a` ("Only remount if requested flags differ from current") Fixes: `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") Fixes: `da780e4d27` ("Fix bind mounts of filesystems with certain options set") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-10-24 17:28:25 +11:00
Aleksa Sarai	9350f9013e	merge #4039 into opencontainers/runc:main Kir Kolyshkin (1): libct: use chmod instead of umask LGTMs: lifubang cyphar	2023-10-04 14:55:07 +11:00
Kir Kolyshkin	2e2ecf29ff	libct: use chmod instead of umask Umask is problematic for Go programs as it affects other goroutines (see [1] for more details). Instead of using it, let's just prop up with Chmod. Note this patch misses the MkdirAll call in createDeviceNode. Since the runtime spec does not say anything about creating intermediary directories for device nodes, let's assume that doing it via mkdir with the current umask set is sufficient (if not, we have to reimplement MkdirAll from scratch, with added call to os.Chmod). [1] https://github.com/opencontainers/runc/pull/3563#discussion_r990293788 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-09-27 16:46:53 -07:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	5c7839b503	rootfs: use empty src for MS_REMOUNT The kernel ignores these arguments, and passing them can lead to confusing error messages (the old source is irrelevant for MS_REMOUNT), as well as causing issues for a future patch where we switch to move_mount(2). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-15 19:54:24 -07:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Rodrigo Campos	ec2ffae5f1	libct: Allow rel paths for idmap mounts The idea was to make them strict on dest path from the beginning for idmap mounts, as runc would do that for all mounts in the future. But that is causing too many problems. For now, let's just allow relative paths for idmap mounts too. It just seems safer. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-08-08 13:45:31 +02:00
lifubang	6092a4b42d	fix some file mode bits missing when doing mount syscall Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-08-03 08:44:00 +08:00
Ruediger Pluem	da780e4d27	Fix bind mounts of filesystems with certain options set Currently bind mounts of filesystems with nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime options set fail in rootless mode if the same options are not set for the bind mount. For ro filesystems this was resolved by #2570 by remounting again with ro set. Follow the same approach for nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime but allow to revert back to the old behaviour via the new `--no-mount-fallback` command line option. Add a testcase to verify that bind mounts of filesystems with nodev, nosuid, noexec, noatime options set work in rootless mode. Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem with a ro flag. Add two further testcases that ensure that the above testcases would fail if the `--no-mount-fallback` command line option is set. * contrib/completions/bash/runc: Add `--no-mount-fallback` command line option for bash completion. * create.go: Add `--no-mount-fallback` command line option. * restore.go: Add `--no-mount-fallback` command line option. * run.go: Add `--no-mount-fallback` command line option. * libcontainer/configs/config.go: Add `NoMountFallback` field to the `Config` struct to store the command line option value. * libcontainer/specconv/spec_linux.go: Add `NoMountFallback` field to the `CreateOpts` struct to store the command line option value and store it in the libcontainer config. * utils_linux.go: Store the command line option value in the `CreateOpts` struct. * libcontainer/rootfs_linux.go: In case that `--no-mount-fallback` is not set try to remount the bind filesystem again with the options nodev, nosuid, noexec, noatime, relatime, strictatime or nodiratime if they are set on the source filesystem. * tests/integration/mounts_sshfs.bats: Add testcases and rework sshfs setup to allow specifying different mount options depending on the test case. Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>	2023-07-28 16:32:02 -07:00
Francis Laniel	a3785c88ec	Remove idmapFD field for mountEntry We cannot have both srcFD and idMapFD set at the same time. So, we can simplify this struct to only have one field which is used a srcFD most of the time and as idMapFD when we do an id map mount. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Francis Laniel	46ada59ba2	Use an int for srcFD Previously to this commit, we used a string for srcFD as /proc/self/fd/NN. This commit modified to this behavior, so srcFD is only an int and the full path is constructed in mountViaFDs() if srcFD is different than nil. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>	2023-07-21 13:55:34 +02:00
Rodrigo Campos	fda12ab101	Support idmap mounts on volumes This commit adds support for idmap mounts as specified in the runtime-spec. We open the idmap source paths and call mount_setattr() in runc PARENT, as we need privileges in the init userns for that, and then sends the fds to the child process. For this fd passing we use the same mechanism used in other parts of thecode, the _LIBCONTAINER_ env vars. The mount is finished (unix.MoveMount) from go code, inside the userns, so we reuse all the prepareBindMount() security checks and the remount logic for some flags too. This commit only supports idmap mounts when userns are used AND the mappings are the same specified for the userns mapping. This limitation is to simplify the initial implementation, as all our users so far only need this, and we can avoid sending over netlink the mappings, creating a userns with this custom mapping, etc. Future PRs will remove this limitation. Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-17 13:30:12 +02:00
Rodrigo Campos	fe4528b176	libcontainer: Just print the mountFds slice len on errors Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Rodrigo Campos	73b649705a	libcontainer: Add mountFds struct We will need to pass more slices of fds to these functions in future patches. Let's add a struct that just contains them all, instead of adding lot of parameters to these functions. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
Brian Goff	9fa8b9de3e	Fix tmpfs mode opts when dir already exists When a directory already exists (or after a container is restarted) the perms of the directory being mounted to were being used even when a different permission is set on the tmpfs mount options. This prepends the original directory perms to the mount options. If the perms were already set in the mount opts then those perms will win. This eliminates the need to perform a chmod after mount entirely. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2023-06-26 21:53:21 +00:00
Kir Kolyshkin	a60933bb24	libct/rootfs: introduce and use mountEntry Adding fd field to mountConfig was not a good thing since mountConfig contains data that is not specific to a particular mount, while fd is a mount entry attribute. Introduce mountEntry structure, which embeds configs.Mount and adds srcFd to replace the removed mountConfig.fd. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-05-02 18:54:38 -07:00
Kir Kolyshkin	976748e8d6	libct: add mountViaFDs, simplify mount 1. Simplify mount call by removing the procfd argument, and use the new mount() where procfd is not used. Now, the mount() arguments are the same as for unix.Mount. 2. Introduce a new mountViaFDs function, which is similar to the old mount(), except it can take procfd for both source and target. The new arguments are called srcFD and dstFD. 3. Modify the mount error to show both srcFD and dstFD so it's clear which one is used for which purpose. This fixes the issue of having a somewhat cryptic errors like this: > mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted (in which fd 11 is actually the source, and fd 12 is the target). After this change, it looks like > mount src=/proc/self/fd/11, dst=/sys/fs/cgroup/systemd, dstFD=/proc/self/fd/12, flags=0x20502f: operation not permitted so it's clear that 12 is a destination fd. 4. Fix the mountViaFDs callers to use dstFD (rather than procfd) for the variable name. 5. Use srcFD where mountFd is set. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-05-02 18:41:09 -07:00
Qiang Huang	0d62b950e6	Merge pull request from GHSA-m8cg-xc2p-r3fc rootless: fix /sys/fs/cgroup mounts	2023-03-29 14:18:15 +08:00
Kir Kolyshkin	da98076c97	mountToRootfs: minor refactor The setRecAttr is only called for "bind" case, as cases end with a return statement. Indeed, recursive mount attributes only make sense for bind mounts. Move the code to under case "bind" to improve readability. No change in logic. Fixes: `382eba4354` Reported-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-03-27 12:47:05 -07:00
Kir Kolyshkin	0d72adf96d	Prohibit /proc and /sys to be symlinks Commit `3291d66b98` introduced a check for /proc and /sys, making sure the destination (dest) is a directory (and not e.g. a symlink). Later, a hunk from commit `0ca91f44f` switched from using filepath.Join to SecureJoin for dest. As SecureJoin follows and resolves symlinks, the check whether dest is a symlink no longer works. To fix, do the check without/before using SecureJoin. Add integration tests to make sure we won't regress. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-03-17 11:03:44 -07:00
Akihiro Suda	df4eae457b	rootless: fix /sys/fs/cgroup mounts It was found that rootless runc makes `/sys/fs/cgroup` writable in following conditons: 1. when runc is executed inside the user namespace, and the config.json does not specify the cgroup namespace to be unshared (e.g.., `(docker\|podman\|nerdctl) run --cgroupns=host`, with Rootless Docker/Podman/nerdctl) 2. or, when runc is executed outside the user namespace, and `/sys` is mounted with `rbind, ro` (e.g., `runc spec --rootless`; this condition is very rare) A container may gain the write access to user-owned cgroup hierarchy `/sys/fs/cgroup/user.slice/...` on the host. Other users's cgroup hierarchies are not affected. To fix the issue, this commit does: 1. Remount `/sys/fs/cgroup` to apply `MS_RDONLY` when it is being bind-mounted 2. Mask `/sys/fs/cgroup` when the bind source is unavailable Fix CVE-2023-25809 (GHSA-m8cg-xc2p-r3fc) Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2023-03-14 14:16:25 +09:00
Kir Kolyshkin	d370e3c046	libct: fix mounting via wrong proc fd Due to a bug in commit `9c444070ec`, when the user and mount namespaces are used, and the bind mount is followed by the cgroup mount in the spec, the cgroup is mounted using the bind mount's mount fd. This can be reproduced with podman 4.1 (when configured to use runc): $ podman run --uidmap 0💯10000 quay.io/libpod/testimage:20210610 mount Error: /home/kir/git/runc/runc: runc create failed: unable to start container process: error during container init: error mounting "cgroup" to rootfs at "/sys/fs/cgroup": mount /proc/self/fd/11:/sys/fs/cgroup/systemd (via /proc/self/fd/12), flags: 0x20502f: operation not permitted: OCI permission denied or manually with the spec mounts containing something like this: { "destination": "/etc/resolv.conf", "type": "bind", "source": "/userdata/resolv.conf", "options": [ "bind" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "rprivate", "nosuid", "noexec", "nodev", "relatime", "ro" ] } The issue was not found earlier since it requires using userns, and even then mount fd is ignored by mountToRootfs, except for bind mounts, and all the bind mounts have mountfd set, except for the case of cgroup v1's /sys/fs/cgroup which is internally transformed into a bunch of bind mounts. This is a minimal fix for the issue, suitable for backporting. A test case is added which reproduces the issue without the fix applied. Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-06-16 11:54:42 -07:00
Irwin D'Souza	b76b6b9338	Allow mounting of /proc/sys/kernel/ns_last_pid The CAP_CHECKPOINT_RESTORE linux capability provides the ability to update /proc/sys/kernel/ns_last_pid. However, because this file is under /proc, and by default both K8s and CRI-O specify that /proc/sys should be mounted as Read-Only, by default even with the capability specified, a process will not be able to write to ns_last_pid. To get around this, a pod author can specify a volume mount and a hostpath to bind-mount /proc/sys/kernel/ns_last_pid. However, runc does not allow specifying mounts under /proc. This commit adds /proc/sys/kernel/ns_last_pid to the validProcMounts string array to enable a pod author to mount ns_last_pid as read-write. The default remains unchanged; unless explicitly requested as a volume mount, ns_last_pid will remain read-only regardless of whether or not CAP_CHECKPOINT_RESTORE is specified. Signed-off-by: Irwin D'Souza <dsouzai.gh@gmail.com>	2022-04-07 14:08:59 -04:00
Kir Kolyshkin	0fec1c2d8c	libct: Mount: rm {Pre,Post}mountCmds Those were added by commit `59c5c3ac0` back in Apr 2015, but AFAICS were never used and are obsoleted by more generic container hooks (initially added by commit `05567f2c94` in Sep 2015). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 15:51:55 -08:00
Akihiro Suda	382eba4354	Support recursive mount attrs ("rro", "rnosuid", "rnodev", ...) The new mount option "rro" makes the mount point recursively read-only, by calling `mount_setattr(2)` with `MOUNT_ATTR_RDONLY` and `AT_RECURSIVE`. https://man7.org/linux/man-pages/man2/mount_setattr.2.html Requires kernel >= 5.12. The "rro" option string conforms to the proposal in util-linux/util-linux Issue 1501. Fix issue 2823 Similary, this commit also adds the following mount options: - rrw - r[no]{suid,dev,exec,relatime,atime,strictatime,diratime,symfollow} Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-12-07 17:39:57 +09:00
Aleksa Sarai	19d696ec29	merge branch 'pr-3276' Kir Kolyshkin (2): runc run: fix ro /dev test/int/mount.bats: refer to github issue LGTMs: thaJeztah cyphar Closes #3276	2021-11-26 09:37:43 +11:00
Kir Kolyshkin	50105de1d8	Fix failure with rw bind mount of a ro fuse As reported in [1], in a case where read-only fuse (sshfs) mount is used as a volume without specifying ro flag, the kernel fails to remount it (when adding various flags such as nosuid and nodev), returning EPERM. Here's the relevant strace line: > [pid 333966] mount("/tmp/bats-run-PRVfWc/runc.RbNv8g/bundle/mnt", "/proc/self/fd/7", 0xc0001e9164, MS_NOSUID\|MS_NODEV\|MS_REMOUNT\|MS_BIND\|MS_REC, NULL) = -1 EPERM (Operation not permitted) I was not able to reproduce it with other read-only mounts as the source (tried tmpfs, read-only bind mount, and an ext2 mount), so somehow this might be specific to fuse. The fix is to check whether the source has RDONLY flag, and retry the remount with this flag added. A test case (which was kind of hard to write) is added, and it fails without the fix. Note that rootless user need to be able to ssh to rootless@localhost in order to sshfs to work -- amend setup scripts to make it work, and skip the test if the setup is not working. [1] https://github.com/containers/podman/issues/12205 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-18 13:09:41 -08:00
Kir Kolyshkin	b247cd392a	runc run: fix ro /dev Commit `fb4c27c4b7` (went into v1.0.0-rc93) fixed a bug with read-only tmpfs, but introduced a bug with read-only /dev. This happens because /dev is a tmpfs mount and is therefore remounted read-only a bit earlier than before. To fix, 1. Revert the part of the above commit which remounts all tmpfs mounts as read-only in mountToRootfs. 2. Reuse finalizeRootfs (which is already used to remount /dev read-only) to also remount all ro tmpfs mounts that were previously mounted rw in mountPropagate. 3. Remove the break in finalizeRootfs, as now we have more than one mount to care about. 4. Reorder the if statements in finalizeRootfs to perform the fast check (for ro flag) first, and compare the strings second. Since /dev is most probably also a tmpfs mount, do the m.Device check first. Add a test case to validate the fix and prevent future regressions; make sure it fails before the fix: ✗ runc run [ro /dev mount] (in test file tests/integration/mounts.bats, line 45) `[ "$status" -eq 0 ]' failed runc spec (status=0): runc run test_busybox (status=1): time="2021-11-12T12:19:48-08:00" level=error msg="runc run failed: unable to start container process: error during container init: error mounting \"devpts\" to rootfs at \"/dev/pts\": mkdir /tmp/bats-run-VJXQk7/runc.0Fj70w/bundle/rootfs/dev/pts: read-only file system" Fixes: `fb4c27c4b7` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-15 10:37:16 -08:00
Kir Kolyshkin	7563a8f06d	libct: wrap more unix errors When I tried to start a rootless container under a different/wrong user, I got: $ ../runc/runc --systemd-cgroup --root /tmp/runc.$$ run 445 ERRO[0000] runc run failed: operation not permitted This is obviously not good enough. With this commit, the error is: ERRO[0000] runc run failed: fchown fd 9: operation not permitted Alas, there are still some code that returns unwrapped errnos from various unix calls. This is a followup to commit `d8ba4128b2` which wrapped many, but not all, bare unix errors. Do wrap some more, using either os.PathError or os.SyscallError. While at it, - use os.SyscallError instead of os.NewSyscallError; - use errors.Is(err, os.ErrXxx) instead of os.IsXxx(err). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 00:33:59 -08:00
Akihiro Suda	4d17654479	Merge pull request #2576 from kinvolk/alban/userns-2484-take2 Open bind mount sources from the host userns	2021-10-28 14:50:33 +09:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Alban Crequy	9c444070ec	Open bind mount sources from the host userns The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount fds, and close them early. Fixes: #2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-10-12 15:13:45 +02:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
Aleksa Sarai	09b80811f6	Revert "libct/devices: change devices.Type to be a string" This reverts commit `814f3ae1d9`. This changed the on-disk state which breaks runc when it has to operate on containers started with an older runc version. Working around this is far more complicated than just reverting it. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-08-25 14:11:32 +10:00
Kir Kolyshkin	34df203d13	Merge pull request #3159 from thaJeztah/norunes libct/devices: change devices.Type to be a string	2021-08-23 16:58:34 -07:00
Kir Kolyshkin	75761bccf7	Fix codespell warnings, add codespell to ci The two exceptions I had to add to codespellrc are: - CLOS (used by intelrtd); - creat (syscall name used in tests/integration/testdata/seccomp_*.json). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-17 16:12:35 -07:00
Sebastiaan van Stijn	814f3ae1d9	libct/devices: change devices.Type to be a string Possibly there was a specific reason to use a rune for this, but I noticed that there's various parts in the code that has to convert values from a string to this type. Using a string as type for this can simplify some of that code. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-08-13 00:55:22 +02:00
Akihiro Suda	5547b5774f	Merge pull request #3033 from kolyshkin/rm-own-errors libcontainer: rm own error system	2021-07-01 13:47:27 +09:00

1 2 3 4

197 Commits