zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-04 23:23:09 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	dd827f7b71	utils: switch to securejoin.MkdirAllHandle filepath-securejoin has a bunch of extra hardening features and is very well-tested, so we should use it instead of our own homebrew solution. A lot of rootfs_linux.go callers pass a SecureJoin'd path, which means we need to keep the wrapper helpers in utils, but at least the core logic is no longer in runc. In future we will want to remove this dodgy logic and just use file handles for everything (using libpathrs, ideally). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-09-03 23:06:47 +10:00
Aleksa Sarai	63c2908164	rootfs: try to scope MkdirAll to stay inside the rootfs While we use SecureJoin to try to make all of our target paths inside the container safe, SecureJoin is not safe against an attacker than can change the path after we "resolve" it. os.MkdirAll can inadvertently follow symlinks and thus an attacker could end up tricking runc into creating empty directories on the host (note that the container doesn't get access to these directories, and the host just sees empty directories). However, this could potentially cause DoS issues by (for instance) creating a directory in a conf.d directory for a daemon that doesn't handle subdirectories properly. In addition, the handling for creating file bind-mounts did a plain open(O_CREAT) on the SecureJoin'd path, which is even more obviously unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an attacker to cause data loss...). Regardless of the symlink issue, opening an untrusted file could result in a DoS if the file is a hung tty or some other "nasty" file. We can use mknodat to safely create a regular file without opening anything anyway (O_CREAT\|O_EXCL would also work but it makes the logic a bit more complicated, and we don't want to open the file for any particular reason anyway). libpathrs[1] is the long-term solution for these kinds of problems, but for now we can patch this particular issue by creating a more restricted MkdirAll that refuses to resolve symlinks and does the creation using file descriptors. This is loosely based on a more secure version that filepath-securejoin now has[2] and will be added to libpathrs soon[3]. [1]: https://github.com/openSUSE/libpathrs [2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0 [3]: https://github.com/openSUSE/libpathrs/issues/10 Fixes: CVE-2024-45310 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-09-03 02:34:13 +10:00
Aleksa Sarai	1410a6988d	rootfs: consolidate mountpoint creation logic The logic for how we create mountpoints is spread over each mountpoint preparation function, when in reality the behaviour is pretty uniform with only a handful of exceptions. So just move it all to one function that is easier to understand. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-07-25 14:16:05 +10:00
Sebastiaan van Stijn	c14213399a	remove pre-go1.17 build-tags Removed pre-go1.17 build-tags with go fix; go fix -mod=readonly ./... Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-29 15:45:25 +02:00
Aleksa Sarai	02120488a4	Merge pull request from GHSA-xr7r-f8xq-vfvv fix GHSA-xr7r-f8xq-vfvv and harden fd leaks	2024-02-01 07:04:29 +11:00
Aleksa Sarai	d8edada9f2	init: don't special-case logrus fds We close the logfd before execve so there's no need to special case it. In addition, it turns out that (*os.File).Fd() doesn't handle the case where the file was closed and so it seems suspect to use that kind of check. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:59 +11:00
Aleksa Sarai	f2f16213e1	init: close internal fds before execve If we leak a file descriptor referencing the host filesystem, an attacker could use a /proc/self/fd magic-link as the source for execve to execute a host binary in the container. This would allow the binary itself (or a process inside the container in the 'runc exec' case) to write to a host binary, leading to a container escape. The simple solution is to make sure we close all file descriptors immediately before the execve(2) step. Doing this earlier can lead to very serious issues in Go (as file descriptors can be reused, any (*os.File) reference could start silently operating on a different file) so we have to do it as late as possible. Unfortunately, there are some Go runtime file descriptors that we must not close (otherwise the Go scheduler panics randomly). The only way of being sure which file descriptors cannot be closed is to sneakily go:linkname the runtime internal "internal/poll.IsPollDescriptor" function. This is almost certainly not recommended but there isn't any other way to be absolutely sure, while also closing any other possible files. In addition, we can keep the logrus forwarding logfd open because you cannot execve a pipe and the contents of the pipe are so restricted (JSON-encoded in a format we pick) that it seems unlikely you could even construct shellcode. Closing the logfd causes issues if there is an error returned from execve. In mainline runc, runc-dmz protects us against this attack because the intermediate execve(2) closes all of the O_CLOEXEC internal runc file descriptors and thus runc-dmz cannot access them to attack the host. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:58 +11:00
Aleksa Sarai	7094efb192	init: use *os.File for passed file descriptors While it doesn't make much of a practical difference, it seems far more reasonable to use os.NewFile to wrap all of our passed file descriptors to make sure they're tracked by the Go runtime and that we don't double-close them. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-22 17:34:14 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Aleksa Sarai	70f4e46e68	utils: use close_range(2) to close leftover file descriptors close_range(2) is far more efficient than a readdir of /proc/self/fd and then doing a syscall for each file descriptor. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-01 15:37:58 +10:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Kir Kolyshkin	d8da00355e	*: add go-1.17+ go:build tags Go 1.17 introduce this new (and better) way to specify build tags. For more info, see https://golang.org/design/draft-gobuild. As a way to seamlessly switch from old to new build tags, gofmt (and gopls) from go 1.17 adds the new tags along with the old ones. Later, when go < 1.17 is no longer supported, the old build tags can be removed. Now, as I started to use latest gopls (v0.7.1), it adds these tags while I edit. Rather than to randomly add new build tags, I guess it is better to do it once for all files. Mind that previous commits removed some tags that were useless, so this one only touches packages that can at least be built on non-linux. Brought to you by go1.17 fmt ./... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:58:22 -07:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Aleksa Sarai	d463f6485b	*: verify that operations on /proc/... are on procfs This is an additional mitigation for CVE-2019-16884. The primary problem is that Docker can be coerced into bind-mounting a file system on top of /proc (resulting in label-related writes to /proc no longer happening). While we are working on mitigations against permitting the mounts, this helps avoid our code from being tricked into writing to non-procfs files. This is not a perfect solution (after all, there might be a bind-mount of a different procfs file over the target) but in order to exploit that you would need to be able to tweak a config.json pretty specifically (which thankfully Docker doesn't allow). Specifically this stops AppArmor from not labeling a process silently due to /proc/self/attr/... being incorrectly set, and stops any accidental fd leaks because /proc/self/fd/... is not real. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-09-30 09:06:48 +10:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Michael Crosby	00a0ecf554	Add separate console socket Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-03-16 10:23:59 -07:00
John Howard	9f80f3f181	Windows: Factor out CloseExecFrom Signed-off-by: John Howard <jhoward@microsoft.com>	2015-06-26 20:13:17 -07:00

19 Commits