zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-05 15:37:02 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Aleksa Sarai	70f4e46e68	utils: use close_range(2) to close leftover file descriptors close_range(2) is far more efficient than a readdir of /proc/self/fd and then doing a syscall for each file descriptor. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-08-01 15:37:58 +10:00
Kir Kolyshkin	5516294172	Remove io/ioutil use See https://golang.org/doc/go1.16#ioutil Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-14 13:46:02 -07:00
Kir Kolyshkin	d8da00355e	*: add go-1.17+ go:build tags Go 1.17 introduce this new (and better) way to specify build tags. For more info, see https://golang.org/design/draft-gobuild. As a way to seamlessly switch from old to new build tags, gofmt (and gopls) from go 1.17 adds the new tags along with the old ones. Later, when go < 1.17 is no longer supported, the old build tags can be removed. Now, as I started to use latest gopls (v0.7.1), it adds these tags while I edit. Rather than to randomly add new build tags, I guess it is better to do it once for all files. Mind that previous commits removed some tags that were useless, so this one only touches packages that can at least be built on non-linux. Brought to you by go1.17 fmt ./... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:58:22 -07:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Aleksa Sarai	d463f6485b	*: verify that operations on /proc/... are on procfs This is an additional mitigation for CVE-2019-16884. The primary problem is that Docker can be coerced into bind-mounting a file system on top of /proc (resulting in label-related writes to /proc no longer happening). While we are working on mitigations against permitting the mounts, this helps avoid our code from being tricked into writing to non-procfs files. This is not a perfect solution (after all, there might be a bind-mount of a different procfs file over the target) but in order to exploit that you would need to be able to tweak a config.json pretty specifically (which thankfully Docker doesn't allow). Specifically this stops AppArmor from not labeling a process silently due to /proc/self/attr/... being incorrectly set, and stops any accidental fd leaks because /proc/self/fd/... is not real. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-09-30 09:06:48 +10:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Michael Crosby	00a0ecf554	Add separate console socket Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-03-16 10:23:59 -07:00
John Howard	9f80f3f181	Windows: Factor out CloseExecFrom Signed-off-by: John Howard <jhoward@microsoft.com>	2015-06-26 20:13:17 -07:00

11 Commits