zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-07 08:21:01 +08:00

Author	SHA1	Message	Date
Rodrigo Campos	b0b186e64d	Merge pull request #4630 from kolyshkin/clean-path libc/utils: simplify CleanPath	2025-02-13 13:59:23 -03:00
Kir Kolyshkin	8db6ffbeef	libc/utils: simplify CleanPath This simplifies the code flow and basically removes the last filepath.Clean, which is not necessary in either case: - for absolute path, single filepath.Clean is enough (as it is guaranteed to remove all dot and dot-dot elements); - for relative path, filepath.Rel calls Clean at the end (which is even documented). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-12 20:17:51 -08:00
Kir Kolyshkin	055041e874	libct: use strings.CutPrefix where possible Using strings.CutPrefix (available since Go 1.20) instead of strings.HasPrefix and/or strings.TrimPrefix makes the code a tad more straightforward. No functional change. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-06 19:42:35 -08:00
Kir Kolyshkin	259b71c042	libct/utils: stripRoot: rm useless HasPrefix Using strings.HasPrefix with strings.TrimPrefix results in doing the same thing (checking if prefix exists) twice. In this case, using strings.TrimPrefix right away is sufficient. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-06 19:42:35 -08:00
Amir M. Ghazanfari	faffe1b9ee	replace strings.SplitN with strings.Cut Signed-off-by: Amir M. Ghazanfari <a.m.ghazanfari76@gmail.com>	2024-09-28 10:02:21 +03:30
Kir Kolyshkin	a31efe7045	libct/seccomp/patchbpf: use binary.NativeEndian It is available since Go 1.21 and is defined during compile time (i.e. based on GOARCH during build). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-09-11 22:06:58 -07:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Kir Kolyshkin	3d86d31b9f	libct/utils: SearchLabels: optimize Using strings.Split generates temporary strings for GC to collect. Rewrite the function to not do that. Also, add a second return value, so that the caller can distinguish between an empty value found and no key found cases. Fix the test accordingly. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 14:01:11 -08:00
Kir Kolyshkin	b950b778c2	libct/utils: ResolveRootfs: remove Since commit `8850636eb3` (February 2015) this function is no longer used (replaced by (*ConfigValidator).rootfs), so let's remove it, together with its unit tests (which were added by commit `917c1f6d6` in April 2016). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 19:21:26 -08:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Aleksa Sarai	0ca91f44f1	rootfs: add mount destination validation Because the target of a mount is inside a container (which may be a volume that is shared with another container), there exists a race condition where the target of the mount may change to a path containing a symlink after we have sanitised the path -- resulting in us inadvertently mounting the path outside of the container. This is not immediately useful because we are in a mount namespace with MS_SLAVE mount propagation applied to "/", so we cannot mount on top of host paths in the host namespace. However, if any subsequent mountpoints in the configuration use a subdirectory of that host path as a source, those subsequent mounts will use an attacker-controlled source path (resolved within the host rootfs) -- allowing the bind-mounting of "/" into the container. While arguably configuration issues like this are not entirely within runc's threat model, within the context of Kubernetes (and possibly other container managers that provide semi-arbitrary container creation privileges to untrusted users) this is a legitimate issue. Since we cannot block mounting from the host into the container, we need to block the first stage of this attack (mounting onto a path outside the container). The long-term plan to solve this would be to migrate to libpathrs, but as a stop-gap we implement libpathrs-like path verification through readlink(/proc/self/fd/$n) and then do mount operations through the procfd once it's been verified to be inside the container. The target could move after we've checked it, but if it is inside the container then we can assume that it is safe for the same reason that libpathrs operations would be safe. A slight wrinkle is the "copyup" functionality we provide for tmpfs, which is the only case where we want to do a mount on the host filesystem. To facilitate this, I split out the copy-up functionality entirely so that the logic isn't interspersed with the regular tmpfs logic. In addition, all dependencies on m.Destination being overwritten have been removed since that pattern was just begging to be a source of more mount-target bugs (we do still have to modify m.Destination for tmpfs-copyup but we only do it temporarily). Fixes: CVE-2021-30465 Reported-by: Etienne Champetier <champetier.etienne@gmail.com> Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com> Reviewed-by: Samuel Karp <skarp@amazon.com> Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin) Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-05-19 16:58:35 +10:00
Aleksa Sarai	7a8d7162f9	seccomp: prepend -ENOSYS stub to all filters Having -EPERM is the default was a fairly significant mistake from a future-proofing standpoint in that it makes any new syscall return a non-ignorable error (from glibc's point of view). We need to correct this now because faccessat2(2) is something glibc critically needs to have support for, but they're blocked on container runtimes because we return -EPERM unconditionally (leading to confusion in glibc). This is also a problem we're probably going to keep running into in the future. Unfortunately there are several issues which stop us from having a clean solution to this problem: 1. libseccomp has several limitations which require us to emulate behaviour we want: a. We cannot do logic based on syscall number, meaning we cannot specify a "largest known syscall number"; b. libseccomp doesn't know in which kernel version a syscall was added, and has no API for "minimum kernel version" so we cannot simply ask libseccomp to generate sane -ENOSYS rules for us. c. Additional seccomp rules for the same syscall are not treated as distinct rules -- if rules overlap, seccomp will merge them. This means we cannot add per-syscall -EPERM fallbacks; d. There is no inverse operation for SCMP_CMP_MASKED_EQ; e. libseccomp does not allow you to specify multiple rules for a single argument, making it impossible to invert OR rules for arguments. 2. The runtime-spec does not have any way of specifying: a. The errno for the default action; b. The minimum kernel version or "newest syscall at time of profile creation"; nor c. Which syscalls were intentionally excluded from the allow list (weird syscalls that are no longer used were excluded entirely, but Docker et al expect those syscalls to get EPERM not ENOSYS). 3. Certain syscalls should not return -ENOSYS (especially only for certain argument combinations) because this could also trigger glibc confusion. This means we have to return -EPERM for certain syscalls but not as a global default. 4. There is not an obvious (and reasonable) upper limit to syscall numbers, so we cannot create a set of rules for each syscall above the largest syscall number in libseccomp. This means we must handle inverse rules as described below. 5. Any syscall can be specified multiple times, which can make generation of hotfix rules much harder. As a result, we have to work around all of these things by coming up with a heuristic to stop the bleeding. In the future we could hopefully improve the situation in the runtime-spec and libseccomp. The solution applied here is to prepend a "stub" filter which returns -ENOSYS if the requested syscall has a larger syscall number than any syscall mentioned in the filter. The reason for this specific rule is that syscall numbers are (roughly) allocated sequentially and thus newer syscalls will (usually) have a larger syscall number -- thus causing our filters to produce -ENOSYS if the filter was written before the syscall existed. Sadly this is not a perfect solution because syscalls can be added out-of-order and the syscall table can contain holes for several releases. Unfortuntely we do not have a nicer solution at the moment because there is no library which provides information about which Linux version a syscall was introduced in. Until that exists, this workaround will have to be good enough. The above behaviour only happens if the default action is a blocking action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the default action is permissive then we don't do any patching. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-01-28 23:11:22 +11:00
Shengjing Zhu	f4d153b086	Fix int overflow in test on 32 bit system Signed-off-by: Shengjing Zhu <zhsj@debian.org>	2021-01-24 16:37:32 +08:00
Mrunal Patel	fe3d5c4c6e	Remove unused veth setup code Networking is setup by plugins for users of runc so it makes sense to get rid of the veth strategy. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-08-24 15:41:52 -07:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Xianglin Gao	9df4847a23	tiny fix Signed-off-by: Xianglin Gao <xlgao@zju.edu.cn>	2016-10-11 16:32:56 +08:00
Qiang Huang	dc0a4cf488	Fix TestGetAdditionalGroups on i686 Fixes: #941 Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-09-27 18:25:53 +08:00
Michael Crosby	5abffd3100	Add annotations to list and state output Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-02 12:44:43 -07:00
George Lestaris	f7ae27bfb7	HookState adhears to OCI Signed-off-by: George Lestaris <glestaris@pivotal.io> Signed-off-by: Ed King <eking@pivotal.io>	2016-04-06 16:57:59 +01:00
Aleksa Sarai	b8dc5213e8	libcontainer: cgroups: fs: fix path safety Ensure that path safety is maintained, this essentially reapplies `c0cad6aa5e` ("cgroups: fs: fix cgroup.Parent path sanitisation"), which was accidentally removed in `256f3a8ebc` ("Add support for CgroupsPath field"). Signed-off-by: Aleksa Sarai <asarai@suse.com>	2016-02-14 00:37:21 +11:00
Kenfe-Mickael Laventure	dceeb0d0df	Move pathClean to libcontainer/utils.CleanPath Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>	2016-02-09 16:21:58 -08:00
Michael Crosby	ddcee3cc2a	Do not use stream encoders Marshall the raw objects for the sync pipes so that no new line chars are left behind in the pipe causing errors. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-26 11:22:05 -08:00
rajasec	58e3cde8f3	Fixing typo in the comment for exit Signed-off-by: rajasec <rajasec79@gmail.com>	2015-10-22 19:08:03 +05:30
John Howard	9f80f3f181	Windows: Factor out CloseExecFrom Signed-off-by: John Howard <jhoward@microsoft.com>	2015-06-26 20:13:17 -07:00
Michael Crosby	8f97d39dd2	Move libcontainer into subdirectory Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:15 -07:00

26 Commits