zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-08 17:00:13 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	83350c24a9	libct/system: rm Fexecve This helper was added for runc-dmz in commit `dac417174`, but runc-dmz was later removed in commit `871057d`, which forgot to remove the helper. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-03 13:57:05 -08:00
Aleksa Sarai	dd827f7b71	utils: switch to securejoin.MkdirAllHandle filepath-securejoin has a bunch of extra hardening features and is very well-tested, so we should use it instead of our own homebrew solution. A lot of rootfs_linux.go callers pass a SecureJoin'd path, which means we need to keep the wrapper helpers in utils, but at least the core logic is no longer in runc. In future we will want to remove this dodgy logic and just use file handles for everything (using libpathrs, ideally). Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-09-03 23:06:47 +10:00
Aleksa Sarai	63c2908164	rootfs: try to scope MkdirAll to stay inside the rootfs While we use SecureJoin to try to make all of our target paths inside the container safe, SecureJoin is not safe against an attacker than can change the path after we "resolve" it. os.MkdirAll can inadvertently follow symlinks and thus an attacker could end up tricking runc into creating empty directories on the host (note that the container doesn't get access to these directories, and the host just sees empty directories). However, this could potentially cause DoS issues by (for instance) creating a directory in a conf.d directory for a daemon that doesn't handle subdirectories properly. In addition, the handling for creating file bind-mounts did a plain open(O_CREAT) on the SecureJoin'd path, which is even more obviously unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an attacker to cause data loss...). Regardless of the symlink issue, opening an untrusted file could result in a DoS if the file is a hung tty or some other "nasty" file. We can use mknodat to safely create a regular file without opening anything anyway (O_CREAT\|O_EXCL would also work but it makes the logic a bit more complicated, and we don't want to open the file for any particular reason anyway). libpathrs[1] is the long-term solution for these kinds of problems, but for now we can patch this particular issue by creating a more restricted MkdirAll that refuses to resolve symlinks and does the creation using file descriptors. This is loosely based on a more secure version that filepath-securejoin now has[2] and will be added to libpathrs soon[3]. [1]: https://github.com/openSUSE/libpathrs [2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0 [3]: https://github.com/openSUSE/libpathrs/issues/10 Fixes: CVE-2024-45310 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-09-03 02:34:13 +10:00
Sebastiaan van Stijn	c14213399a	remove pre-go1.17 build-tags Removed pre-go1.17 build-tags with go fix; go fix -mod=readonly ./... Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-29 15:45:25 +02:00
Kir Kolyshkin	584afc6756	libct/system: ClearRlimitNofileCache for go 1.23 Go 1.23 tightens access to internal symbols, and even puts runc into "hall of shame" for using an internal symbol (recently added by commit `da68c8e3`). So, while not impossible, it becomes harder to access those internal symbols, and it is a bad idea in general. Since Go 1.23 includes https://go.dev/cl/588076, we can clean the internal rlimit cache by setting the RLIMIT_NOFILE for ourselves, essentially disabling the rlimit cache. Once Go 1.22 is no longer supported, we will remove the go:linkname hack. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-01 13:02:29 -07:00
lifubang	a35f7d8093	fix comments for ClearRlimitNofileCache Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-16 20:34:43 +08:00
ls-ggg	da68c8e37b	libct: clean cached rlimit nofile in go runtime As reported in issue #4195, the new version(since 1.19) of go runtime will cache rlimit-nofile. Before executing execve, the rlimit-nofile of the process will be restored with the cache. In runc, this will cause the rlimit-nofile set by the parent process for the container to become invalid. It can be solved by clearing the cache. Signed-off-by: ls-ggg <335814617@qq.com> (cherry picked from commit `f9f8abf310`) Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-08 10:40:13 +00:00
Kir Kolyshkin	dbd0c3349f	libct/system: rm Execv This is not used since commit `dac41717`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-05-07 14:09:22 -07:00
Zheao.Li	98511bb40e	linux: Support setting execution domain via linux personality carry #3126 Co-authored-by: Aditya R <arajan@redhat.com> Signed-off-by: Zheao.Li <me@manjusaka.me>	2023-10-27 19:33:37 +08:00
Aleksa Sarai	90c8d36afe	dmz: use sendfile(2) when cloning /proc/self/exe This results in a 5-20% speedup of dmz.CloneBinary(), depending on the machine. io.Copy: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 139 8075074 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.286s unix.Sendfile: goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/dmz cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkCloneBinary BenchmarkCloneBinary-8 192 6382121 ns/op PASS ok github.com/opencontainers/runc/libcontainer/dmz 2.415s Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:51:36 +10:00
lifubang	dac4171746	runc-dmz: reduce memfd binary cloning cost with small C binary The idea is to remove the need for cloning the entire runc binary by replacing the final execve() call of the container process with an execve() call to a clone of a small C binary which just does an execve() of its arguments. This provides similar protection against CVE-2019-5736 but without requiring a >10MB binary copy for each "runc init". When compiled with musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB which is still quite large). It should be noted that there is still a window where the container processes could get access to the host runc binary, but because we set ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which is not enabled by default in Docker) in order to get around the proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the kernel blocks access entirely for user namespaced containers in this scenario. For those cases we cannot use runc-dmz, but most containers won't have this issue. This new runc-dmz binary can be opted out of at compile time by setting the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy environment variable. In both cases, runc will fall back to the classic /proc/self/exe-based cloning trick. If /proc/self/exe is already a sealed memfd (namely if the user is using contrib/cmd/memfd-bind to create a persistent sealed memfd for runc), neither runc-dmz nor /proc/self/exe cloning will be used because they are not necessary. [1]: `bfedb58925` Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [cyphar: address various review nits] [cyphar: fix runc-dmz cross-compilation] [cyphar: embed runc-dmz into runc binary and clone in Go code] [cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning] [cyphar: do not use runc-dmz when the container has certain privs] Co-authored-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:38:19 +10:00
Aleksa Sarai	0e9a3358f8	nsexec: migrate memfd /proc/self/exe logic to Go code This allow us to remove the amount of C code in runc quite substantially, as well as removing a whole execve(2) from the nsexec path because we no longer spawn "runc init" only to re-exec "runc init" after doing the clone. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:13:18 +10:00
Kir Kolyshkin	f62f0bdfbf	Remove nolint annotations for unix errno comparisons golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains the fix [1] whitelisting all errno comparisons for errors coming from x/sys/unix. Thus, these annotations are no longer necessary. Hooray! [1] https://github.com/polyfloyd/go-errorlint/pull/47 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 17:28:10 -07:00
Kir Kolyshkin	8491d33482	Fix runc run "permission denied" when rootless Since commit `957d97bcf4` was made to fix issue [7], a few things happened: - a similar functionality appeared in go 1.20 [1], so the issue mentioned in the comment (being removed) is no longer true; - a bug in runc was found [2], which also affects go [3]; - the bug was fixed in go 1.21 [4] and 1.20.2 [5]; - a similar fix was made to x/sys/unix.Faccessat [6]. The essense of [2] is, even if a (non-root) user that the container is run as does not have execute permission bit set for the executable, it should still work in case runc has the CAP_DAC_OVERRIDE capability set. To fix this [2] without reintroducing the older bug [7]: - drop own Eaccess implementation; - use the one from x/sys/unix for Go 1.19 (depends on [6]); - do not use anything when Go 1.20+ is used. NOTE it is virtually impossible to fix the bug [2] when Go 1.20 or Go 1.20.1 is used because of [3]. A test case is added by a separate commit. Fixes: #3715. [1] https://go-review.googlesource.com/c/go/+/414824 [2] https://github.com/opencontainers/runc/issues/3715 [3] https://go.dev/issue/58552 [4] https://go-review.googlesource.com/c/go/+/468735 [5] https://go-review.googlesource.com/c/go/+/469956 [6] https://go-review.googlesource.com/c/sys/+/468877 [7] https://github.com/opencontainers/runc/issues/3520 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-03-27 15:15:48 -07:00
Kir Kolyshkin	957d97bcf4	Fix error from runc run on noexec fs When starting a new container, and the very last step of executing of a user process fails (last lines of (*linuxStandardInit).Init), it is too late to print a proper error since both the log pipe and the init pipe are closed. This is partially mitigated by using exec.LookPath() which is supposed to say whether we will be able to execute or not. Alas, it fails to do so when the binary to be executed resides on a filesystem mounted with noexec flag. A workaround would be to use access(2) with X_OK flag. Alas, it is not working when runc itself is a setuid (or setgid) binary. In this case, faccessat2(2) with AT_EACCESS can be used, but it is only available since Linux v5.8. So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to access(2) for non-setuid runc, and do nothing for setuid runc (as there is nothing we can do). Note that this check if in addition to whatever exec.LookPath does. Fixes https://github.com/opencontainers/runc/issues/3520 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-07-01 10:02:42 -07:00
Kir Kolyshkin	db4ad6a7f1	libcontainer/system: rm Prlimit It is now available from golang.org/x/sys/unix (https://go-review.googlesource.com/c/sys/+/332029) Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-11 19:57:56 -08:00
Kir Kolyshkin	794cd66df8	libct/system: Exec: wrap the error If the container binary to be run is removed in between runc create and runc start, the latter spits the following error: > can't exec user process: no such file or directory This is a bit confusing since we don't see what file is missing. Wrap the unix.Exec error into os.PathError, like in many other cases, to provide some context. Remove the error wrapping from (*linuxStandardInit).Init as it is now redundant. With this patch, the error is now: > exec /bin/false: no such file or directory Reported-by: Daniel J Walsh <dwalsh@redhat.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-07 11:09:08 -07:00
Kir Kolyshkin	d8da00355e	*: add go-1.17+ go:build tags Go 1.17 introduce this new (and better) way to specify build tags. For more info, see https://golang.org/design/draft-gobuild. As a way to seamlessly switch from old to new build tags, gofmt (and gopls) from go 1.17 adds the new tags along with the old ones. Later, when go < 1.17 is no longer supported, the old build tags can be removed. Now, as I started to use latest gopls (v0.7.1), it adds these tags while I edit. Rather than to randomly add new build tags, I guess it is better to do it once for all files. Mind that previous commits removed some tags that were useless, so this one only touches packages that can at least be built on non-linux. Brought to you by go1.17 fmt ./... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:58:22 -07:00
Maksim An	e39ad65059	retry unix.EINTR for container init process When running a script from an azure file share interrupted syscall occurs quite frequently, to remedy this add retries around execve syscall, when EINTR is returned. Signed-off-by: Maksim An <maksiman@microsoft.com>	2021-06-30 22:22:31 -07:00
Sebastiaan van Stijn	4316df8b53	libcontainer/system: move userns utilities to separate package Moving these utilities to a separate package, so that consumers of this package don't have to pull in the whole "system" package. Looking at uses of these utilities (outside of runc itself); `RunningInUserNS()` is used by [various external consumers][1], so adding a "Deprecated" alias for this. [1]: https://grep.app/search?current=2&q=.RunningInUserNS Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-04-04 22:42:03 +02:00
Sebastiaan van Stijn	e7fd383bce	libcontainer/system: un-export UIDMapInUserNS() `UIDMapInUserNS()` is not used anywhere, only internally. so un-export it. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-04-04 22:42:02 +02:00
Sebastiaan van Stijn	249356a1a4	libcontainer/system: remove unused GetParentNSeuid() This function was added in `f103de57ec`, but no longer used since `06f789cf26` (v1.0.0-rc6) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-04-04 22:41:59 +02:00
Sebastiaan van Stijn	9df0b5e268	libcontainer: RunningInUserNS() use sync.Once Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-05-04 15:53:33 +02:00
Kenta Tada	4474795388	libcontainer: use x/sys/unix instead of the hardcoded value PR_SET_CHILD_SUBREAPER is defined in x/sys/unix. Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2020-04-23 10:49:51 +09:00
Kir Kolyshkin	af6b9e7fa9	nit: do not use syscall package In many places (not all of them though) we can use `unix.` instead of `syscall.` as these are indentical. In particular, x/sys/unix defines: ```go type Signal = syscall.Signal type Errno = syscall.Errno type SysProcAttr = syscall.SysProcAttr const ENODEV = syscall.Errno(0x13) ``` and unix.Exec() calls syscall.Exec(). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-04-18 16:16:49 -07:00
Tibor Vass	c205e9fb64	libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits) This fixes the following compilation error on 32bit ARM: ``` $ GOARCH=arm GOARCH=6 go build ./libcontainer/system/ libcontainer/system/linux.go:119:89: constant 4294967295 overflows int ``` Signed-off-by: Tibor Vass <tibor@docker.com>	2018-06-14 18:33:14 +00:00
Akihiro Suda	f103de57ec	main: support rootless mode in userns Running rootless containers in userns is useful for mounting filesystems (e.g. overlay) with mapped euid 0, but without actual root privilege. Usage: (Note that `unshare --mount` requires `--map-root-user`) user$ mkdir lower upper work rootfs user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz \| tar Cxz ./lower \|\| ( true; echo "mknod errors were ignored" ) user$ unshare --mount --map-root-user mappedroot# runc spec --rootless mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs mappedroot# runc run foo Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-10 12:16:43 +09:00
Akihiro Suda	9c7d8bc1fd	libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-10 12:16:43 +09:00
Sebastien Boeuf	bb912eb00c	libcontainer: Do not wait for signalled processes if subreaper is set When a subreaper is enabled, it might expect to reap a process and retrieve its exit code. That's the reason why this patch is giving the possibility to define the usage of a subreaper as a consumer of libcontainer. Relying on this information, libcontainer will not wait for signalled processes in case a subreaper has been set. Fixes #1677 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2017-12-14 10:37:38 -08:00
Tobias Klauser	078e903296	libcontainer: use ioctl wrappers from x/sys/unix Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually reimplementing them. Because of unlockpt, the ioctl wrapper is still needed as it needs to pass a pointer to a value, which is not supported by any ioctl function in x/sys/unix yet. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-10 10:56:58 +02:00
Tobias Klauser	a380fae959	libcontainer: use Prctl() from x/sys/unix Use unix.Prctl() instead of manually reimplementing it using unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally defining it. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-10 10:56:58 +02:00
Tobias Klauser	553016d7da	Use Prctl() from x/sys/unix instead of own wrapper Use unix.Prctl() instead of reimplemnting it as system.Prctl(). Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-06-07 15:03:15 +02:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Akihiro Suda	1829531241	Fix trivial style errors reported by `go vet` and `golint` No substantial code change. Note that some style errors reported by `golint` are not fixed due to possible compatibility issues. Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>	2016-04-12 08:13:16 +00:00
Julian Friedman	e91b2b8aca	Set rlimits using prlimit in parent Fixes #680 This changes setupRlimit to use the Prlimit syscall (rather than Setrlimit) and moves the call to the parent process. This is necessary because Setrlimit would affect the libcontainer consumer if called in the parent, and would fail if called from the child if the child process is in a user namespace and the requested rlimit is higher than that in the parent. Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>	2016-03-25 15:11:44 +00:00
Michael Crosby	fdb100d247	Destroy container along with processes before stdio We need to make sure the container is destroyed before closing the stdio for the container. This becomes a big issues when running in the host's pid namespace because the other processes could have inherited the stdio of the initial process. The call to close will just block as they still have the io open. Calling destroy before closing io, especially in the host pid namespace will cause all additional processes to be killed in the container's cgroup. This will allow the io to be closed successfuly. This change makes sure the order for destroy and close is correct as well as ensuring that if any errors encoutered during start or exec will be handled by terminating the process and destroying the container. We cannot use defers here because we need to enforce the correct ordering on destroy. This also sets the subreaper setting for runc so that when running in pid host, runc can wait on the addiontal processes launched by the container, useful on destroy, but also good for reaping the additional processes that were launched. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-15 13:17:11 -07:00
Mrunal Patel	38b39645d9	Implement NoNewPrivileges support in libcontainer Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2016-02-16 06:57:50 -08:00
Serge Hallyn	c0ad40c5e6	Do not create devices when in user namespace When we launch a container in a new user namespace, we cannot create devices, so we bind mount the host's devices into place instead. If we are running in a user namespace (i.e. nested in a container), then we need to do the same thing. Add a function to detect that and check for it before doing mknod. Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com> --- Changelog - add a comment clarifying what's going on with the uidmap file.	2016-01-08 12:54:08 -08:00
Michael Crosby	8f97d39dd2	Move libcontainer into subdirectory Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:15 -07:00

39 Commits