zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-09-27 03:46:19 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	e655abc0da	int/linux: add/use Dup3, Open, Openat Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-26 14:16:53 -07:00
Kir Kolyshkin	c690b66d7f	int/linux: add/use Exec Drop the libcontainer/system/exec, and use the linux.Exec instead. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-26 14:16:53 -07:00
Kir Kolyshkin	8598f6ec4a	Merge pull request #4354 from ningmingxiao/dev3 skip read /proc/filesystems if process_label is null	2025-03-24 12:09:03 -07:00
Kir Kolyshkin	539315534f	libct: log a warning on join session keyring failure This addresses a TODO item added by commit `40f146841` ("keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)"), as we do have runc init logging working fine for quite some time. While at it, fix a typo in a comment (standart -> standard). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-13 08:42:22 -07:00
ningmingxiao	6a3f8ea3b4	skip read /proc/filesystems if process_label is null Signed-off-by: ningmingxiao <ning.mingxiao@zte.com.cn>	2025-02-12 12:44:39 +08:00
Kir Kolyshkin	99f9ed94dc	runc exec: fix setting process.Scheduler Commit `770728e1` added Scheduler field into both Config and Process, but forgot to add a mechanism to actually use Process.Scheduler. As a result, runc exec does not set Process.Scheduler ever. Fix it, and a test case (which fails before the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	b9114d91e2	runc exec: fix setting process.ioPriority Commit `bfbd0305b` added IOPriority field into both Config and Process, but forgot to add a mechanism to actually use Process.IOPriority. As a result, runc exec does not set Process.IOPriority ever. Fix it, and a test case (which fails before the fix). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-11 18:01:30 -08:00
Kir Kolyshkin	7dc2486889	libct: switch to numeric UID/GID/groups This addresses the following TODO in the code (added back in 2015 by commit `845fc65e5`): > // TODO: fix libcontainer's API to better support uid/gid in a typesafe way. Historically, libcontainer internally uses strings for user, group, and additional (aka supplementary) groups. Yet, runc receives those credentials as part of runtime-spec's process, which uses integers for all of them (see [1], [2]). What happens next is: 1. runc start/run/exec converts those credentials to strings (a User string containing "UID:GID", and a []string for additional GIDs) and passes those onto runc init. 2. runc init converts them back to int, in the most complicated way possible (parsing container's /etc/passwd and /etc/group). All this conversion and, especially, parsing is totally unnecessary, but is performed on every container exec (and start). The only benefit of all this is, a libcontainer user could use user and group names instead of numeric IDs (but runc itself is not using this feature, and we don't know if there are any other users of this). Let's remove this back and forth translation, hopefully increasing runc exec performance. The only remaining need to parse /etc/passwd is to set HOME environment variable for a specified UID, in case $HOME is not explicitly set in process.Env. This can now be done right in prepareEnv, which simplifies the code flow a lot. Alas, we can not use standard os/user.LookupId, as it could cache host's /etc/passwd or the current user (even with the osusergo tag). PS Note that the structures being changed (initConfig and Process) are never saved to disk as JSON by runc, so there is no compatibility issue for runc users. Still, this is a breaking change in libcontainer, but we never promised that libcontainer API will be stable (and there's a special package that can handle it -- github.com/moby/sys/user). Reflect this in CHANGELOG. For 3998. [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user [2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-06 17:49:17 -08:00
Kir Kolyshkin	06f1e07655	libct: speedup process.Env handling The current implementation sets all the environment variables passed in Process.Env in the current process, one by one, then uses os.Environ to read those back. As pointed out in [1], this is slow, as runc calls os.Setenv for every variable, and there may be a few thousands of those. Looking into how os.Setenv is implemented, it is indeed slow, especially when cgo is enabled. Looking into why it was implemented the way it is, I found commit `9744d72c` and traced it to [2], which discusses the actual reasons. It boils down to these two: - HOME is not passed into container as it is set in setupUser by os.Setenv and has no effect on config.Env; - there is a need to deduplicate the environment variables. Yet it was decided in [2] to not go ahead with this patch, but later [3] was opened with the carry of this patch, and merged. Now, from what I see: 1. Passing environment to exec is way faster than using os.Setenv and os.Environ (tests show ~20x speed improvement in a simple Go test, and ~3x improvement in real-world test, see below). 2. Setting environment variables in the runc context may result is some ugly side effects (think GODEBUG, LD_PRELOAD, or _LIBCONTAINER_*). 3. Nothing in runtime spec says that the environment needs to be deduplicated, or the order of preference (whether the first or the last value of a variable with the same name is to be used). We should stick to what we have in order to maintain backward compatibility. So, this patch: - switches to passing env directly to exec; - adds deduplication mechanism to retain backward compatibility; - takes care to set PATH from process.Env in the current process (so that supplied PATH is used to find the binary to execute), also to retain backward compatibility; - adds HOME to process.Env if not set; - ensures any StartContainer CommandHook entries with no environment set explicitly are run with the same environment as before. Thanks to @lifubang who noticed that peculiarity. The benchmark added by the previous commit shows ~3x improvement: │ before │ after │ │ sec/op │ sec/op vs base │ ExecInBigEnv-20 61.53m ± 1% 21.87m ± 16% -64.46% (p=0.000 n=10) [1]: https://github.com/opencontainers/runc/pull/1983 [2]: https://github.com/docker-archive/libcontainer/pull/418 [3]: https://github.com/docker-archive/libcontainer/pull/432 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-01-09 18:22:53 +08:00
lfbzhm	d48d9cfefc	Merge pull request #4459 from kolyshkin/prio-nits Fixups to scheduler/priority settings	2024-12-25 23:41:27 +08:00
Kir Kolyshkin	5d3942eec3	libct: unify IOPriority setting For some reason, io priority is set in different places between runc start/run and runc exec: - for runc start/run, it is done in the middle of (linuxStandardInit).Init, close to the place where we exec runc init. - for runc exec, it is done much earlier, in (setnsProcess) start(). Let's move setIOPriority call for runc exec to (linuxSetnsInit).Init, so it is in the same logical place as for runc start/run. Also, move the function itself to init_linux.go as it's part of init. Should not have any visible effect, except part of runc init is run with a different I/O priority. While at it, rename setIOPriority to setupIOPriority, and make it accept the whole configs.Config, for uniformity with other similar functions. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:15:31 -08:00
Kir Kolyshkin	2dc3ea4b87	libct: simplify setIOPriority/setupScheduler calls Move the nil check inside, simplifying the callers. Fixes: `bfbd0305` ("Add I/O priority") Fixes: `770728e1` ("Support `process.scheduler`") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:06:20 -08:00
Kir Kolyshkin	93091e6ac2	libct: don't pass SpecState to init unless needed SpecState field of initConfig is only needed to run hooks that are executed inside a container -- namely CreateContainer and StartContainer. If these hooks are not configured, there is no need to fill, marshal and unmarshal SpecState. While at it, inline updateSpecState as it is trivial and only has one user. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:52:15 -08:00
Kir Kolyshkin	5586d7caa1	libct: rm obsoleted comment This was added by commit `f2f16213e` when runc-dmz was still a thing. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-10-29 17:11:56 -07:00
lifubang	871057d863	drop runc-dmz solution according to overlay solution Because we have the overlay solution, we can drop runc-dmz binary solution since it has too many limitations. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-10-28 15:18:07 +00:00
Kir Kolyshkin	a1e87f8d76	libct: rm eaccess It is not needed since Go 1.20 (which was released in February 2023 and is no longer supported since February 2024). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-07 10:18:59 -07:00
Kir Kolyshkin	bac506463d	libct: fix a comment Do not refer to the function which was removed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-05-07 14:10:16 -07:00
utam0k	bfbd0305ba	Add I/O priority Signed-off-by: utam0k <k0ma@utam0k.jp>	2024-03-30 22:31:54 +09:00
Aleksa Sarai	02120488a4	Merge pull request from GHSA-xr7r-f8xq-vfvv fix GHSA-xr7r-f8xq-vfvv and harden fd leaks	2024-02-01 07:04:29 +11:00
Aleksa Sarai	f2f16213e1	init: close internal fds before execve If we leak a file descriptor referencing the host filesystem, an attacker could use a /proc/self/fd magic-link as the source for execve to execute a host binary in the container. This would allow the binary itself (or a process inside the container in the 'runc exec' case) to write to a host binary, leading to a container escape. The simple solution is to make sure we close all file descriptors immediately before the execve(2) step. Doing this earlier can lead to very serious issues in Go (as file descriptors can be reused, any (*os.File) reference could start silently operating on a different file) so we have to do it as late as possible. Unfortunately, there are some Go runtime file descriptors that we must not close (otherwise the Go scheduler panics randomly). The only way of being sure which file descriptors cannot be closed is to sneakily go:linkname the runtime internal "internal/poll.IsPollDescriptor" function. This is almost certainly not recommended but there isn't any other way to be absolutely sure, while also closing any other possible files. In addition, we can keep the logrus forwarding logfd open because you cannot execve a pipe and the contents of the pipe are so restricted (JSON-encoded in a format we pick) that it seems unlikely you could even construct shellcode. Closing the logfd causes issues if there is an error returned from execve. In mainline runc, runc-dmz protects us against this attack because the intermediate execve(2) closes all of the O_CLOEXEC internal runc file descriptors and thus runc-dmz cannot access them to attack the host. Fixes: GHSA-xr7r-f8xq-vfvv CVE-2024-21626 Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-24 00:20:58 +11:00
Aleksa Sarai	7094efb192	init: use *os.File for passed file descriptors While it doesn't make much of a practical difference, it seems far more reasonable to use os.NewFile to wrap all of our passed file descriptors to make sure they're tracked by the Go runtime and that we don't double-close them. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2024-01-22 17:34:14 +11:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
lfbzhm	95a93c132c	Merge pull request #4045 from fuweid/support-pidfd-socket [feature request] *: introduce pidfd-socket flag	2023-11-22 09:13:55 +08:00
Wei Fu	94505a046a	*: introduce pidfd-socket flag The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-11-21 18:28:50 +08:00
Zheao.Li	98511bb40e	linux: Support setting execution domain via linux personality carry #3126 Co-authored-by: Aditya R <arajan@redhat.com> Signed-off-by: Zheao.Li <me@manjusaka.me>	2023-10-27 19:33:37 +08:00
utam0k	770728e16e	Support `process.scheduler` Spec: https://github.com/opencontainers/runtime-spec/pull/1188 Fix: https://github.com/opencontainers/runc/issues/3895 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: utam0k <k0ma@utam0k.jp> Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-04 15:53:18 +08:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
lifubang	dac4171746	runc-dmz: reduce memfd binary cloning cost with small C binary The idea is to remove the need for cloning the entire runc binary by replacing the final execve() call of the container process with an execve() call to a clone of a small C binary which just does an execve() of its arguments. This provides similar protection against CVE-2019-5736 but without requiring a >10MB binary copy for each "runc init". When compiled with musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB which is still quite large). It should be noted that there is still a window where the container processes could get access to the host runc binary, but because we set ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which is not enabled by default in Docker) in order to get around the proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the kernel blocks access entirely for user namespaced containers in this scenario. For those cases we cannot use runc-dmz, but most containers won't have this issue. This new runc-dmz binary can be opted out of at compile time by setting the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy environment variable. In both cases, runc will fall back to the classic /proc/self/exe-based cloning trick. If /proc/self/exe is already a sealed memfd (namely if the user is using contrib/cmd/memfd-bind to create a persistent sealed memfd for runc), neither runc-dmz nor /proc/self/exe cloning will be used because they are not necessary. [1]: `bfedb58925` Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com> [cyphar: address various review nits] [cyphar: fix runc-dmz cross-compilation] [cyphar: embed runc-dmz into runc binary and clone in Go code] [cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning] [cyphar: do not use runc-dmz when the container has certain privs] Co-authored-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-22 15:38:19 +10:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Rodrigo Campos	fda12ab101	Support idmap mounts on volumes This commit adds support for idmap mounts as specified in the runtime-spec. We open the idmap source paths and call mount_setattr() in runc PARENT, as we need privileges in the init userns for that, and then sends the fds to the child process. For this fd passing we use the same mechanism used in other parts of thecode, the _LIBCONTAINER_ env vars. The mount is finished (unix.MoveMount) from go code, inside the userns, so we reuse all the prepareBindMount() security checks and the remount logic for some flags too. This commit only supports idmap mounts when userns are used AND the mappings are the same specified for the userns mapping. This limitation is to simplify the initial implementation, as all our users so far only need this, and we can avoid sending over netlink the mappings, creating a userns with this custom mapping, etc. Future PRs will remove this limitation. Co-authored-by: Francis Laniel <flaniel@linux.microsoft.com> Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-17 13:30:12 +02:00
Rodrigo Campos	73b649705a	libcontainer: Add mountFds struct We will need to pass more slices of fds to these functions in future patches. Let's add a struct that just contains them all, instead of adding lot of parameters to these functions. Signed-off-by: Rodrigo Campos <rodrigoca@microsoft.com>	2023-07-11 16:17:48 +02:00
utam0k	d9230602e9	Implement to set a domainname opencontainers/runtime-spec#1156 Signed-off-by: utam0k <k0ma@utam0k.jp>	2023-04-12 13:31:20 +00:00
Kir Kolyshkin	8491d33482	Fix runc run "permission denied" when rootless Since commit `957d97bcf4` was made to fix issue [7], a few things happened: - a similar functionality appeared in go 1.20 [1], so the issue mentioned in the comment (being removed) is no longer true; - a bug in runc was found [2], which also affects go [3]; - the bug was fixed in go 1.21 [4] and 1.20.2 [5]; - a similar fix was made to x/sys/unix.Faccessat [6]. The essense of [2] is, even if a (non-root) user that the container is run as does not have execute permission bit set for the executable, it should still work in case runc has the CAP_DAC_OVERRIDE capability set. To fix this [2] without reintroducing the older bug [7]: - drop own Eaccess implementation; - use the one from x/sys/unix for Go 1.19 (depends on [6]); - do not use anything when Go 1.20+ is used. NOTE it is virtually impossible to fix the bug [2] when Go 1.20 or Go 1.20.1 is used because of [3]. A test case is added by a separate commit. Fixes: #3715. [1] https://go-review.googlesource.com/c/go/+/414824 [2] https://github.com/opencontainers/runc/issues/3715 [3] https://go.dev/issue/58552 [4] https://go-review.googlesource.com/c/go/+/468735 [5] https://go-review.googlesource.com/c/go/+/469956 [6] https://go-review.googlesource.com/c/sys/+/468877 [7] https://github.com/opencontainers/runc/issues/3520 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-03-27 15:15:48 -07:00
Kir Kolyshkin	957d97bcf4	Fix error from runc run on noexec fs When starting a new container, and the very last step of executing of a user process fails (last lines of (*linuxStandardInit).Init), it is too late to print a proper error since both the log pipe and the init pipe are closed. This is partially mitigated by using exec.LookPath() which is supposed to say whether we will be able to execute or not. Alas, it fails to do so when the binary to be executed resides on a filesystem mounted with noexec flag. A workaround would be to use access(2) with X_OK flag. Alas, it is not working when runc itself is a setuid (or setgid) binary. In this case, faccessat2(2) with AT_EACCESS can be used, but it is only available since Linux v5.8. So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to access(2) for non-setuid runc, and do nothing for setuid runc (as there is nothing we can do). Note that this check if in addition to whatever exec.LookPath does. Fixes https://github.com/opencontainers/runc/issues/3520 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-07-01 10:02:42 -07:00
Kir Kolyshkin	bb6a838876	libct: initContainer: rename Id -> ID Since the next commit is going to touch this structure, our CI (lint-extra) is about to complain about improperly named field: > Warning: var-naming: struct field ContainerId should be ContainerID (revive) Make it happy. Brought to use by gopls rename. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 18:59:47 -08:00
Kir Kolyshkin	982b9a1dd3	libct/standard_init: fix linter warning The staticcheck linter points out that the err != nil comparison after system.Exec is always true: > libcontainer/standard_init_linux.go#L253 > SA4023: this comparison is always true (staticcheck) > libcontainer/system/linux.go#L43 > SA4023(related information): github.com/opencontainers/runc/libcontainer/system.Exec never returns a nil interface value (staticcheck) Indeed, Exec either returns an error or does not return at all. Remove the (useless) check. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-18 08:53:50 -08:00
Akihiro Suda	4d17654479	Merge pull request #2576 from kinvolk/alban/userns-2484-take2 Open bind mount sources from the host userns	2021-10-28 14:50:33 +09:00
Mrunal Patel	d5c9905be8	Merge pull request #3235 from kolyshkin/rm-exc-lock libct: Init: remove LockOSThread	2021-10-18 13:52:26 -07:00
Alban Crequy	9c444070ec	Open bind mount sources from the host userns The source of the bind mount might not be accessible in a different user namespace because a component of the source path might not be traversed under the users and groups mapped inside the user namespace. This caused errors such as the following: # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:58: mounting \"/tmp/busyboxtest/source-inaccessible/dir\" to rootfs at \"/tmp/inaccessible\" caused: stat /tmp/busyboxtest/source-inaccessible/dir: permission denied" To solve this problem, this patch performs the following: 1. in nsexec.c, it opens the source path in the host userns (so we have the right permissions to open it) but in the container mntns (so the kernel cross mntns mount check let us mount it later: https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312). 2. in nsexec.c, it passes the file descriptors of the source to the child process with SCM_RIGHTS. 3. In runc-init in Golang, it finishes the mounts while inside the userns even without access to the some components of the source paths. Passing the fds with SCM_RIGHTS is necessary because once the child process is in the container mntns, it is already in the container userns so it cannot temporarily join the host mntns. This patch uses the existing mechanism with _LIBCONTAINER_* environment variables to pass the file descriptors from runc to runc init. This patch uses the existing mechanism with the Netlink-style bootstrap to pass information about the list of source mounts to nsexec.c. Rootless containers don't use this bind mount sources fdpassing mechanism because we can't setns() to the target mntns in a rootless container (we don't have the privileges when we are in the host userns). This patch takes care of using O_CLOEXEC on mount fds, and close them early. Fixes: #2484. Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-10-12 15:13:45 +02:00
Kir Kolyshkin	794cd66df8	libct/system: Exec: wrap the error If the container binary to be run is removed in between runc create and runc start, the latter spits the following error: > can't exec user process: no such file or directory This is a bit confusing since we don't see what file is missing. Wrap the unix.Exec error into os.PathError, like in many other cases, to provide some context. Remove the error wrapping from (*linuxStandardInit).Init as it is now redundant. With this patch, the error is now: > exec /bin/false: no such file or directory Reported-by: Daniel J Walsh <dwalsh@redhat.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-07 11:09:08 -07:00
Kir Kolyshkin	e395d2dc50	libct: Init: remove LockOSThread This call is already made in init.go, no need for a duplicate. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-10-05 19:19:40 -07:00
Alban Crequy	2b025c0173	Implement Seccomp Notify This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 13:04:24 +02:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
Kir Kolyshkin	75761bccf7	Fix codespell warnings, add codespell to ci The two exceptions I had to add to codespellrc are: - CLOS (used by intelrtd); - creat (syscall name used in tests/integration/testdata/seccomp_*.json). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-17 16:12:35 -07:00
Maksim An	e39ad65059	retry unix.EINTR for container init process When running a script from an azure file share interrupted syscall occurs quite frequently, to remedy this add retries around execve syscall, when EINTR is returned. Signed-off-by: Maksim An <maksiman@microsoft.com>	2021-06-30 22:22:31 -07:00
Kir Kolyshkin	e918d02139	libcontainer: rm own error system This removes libcontainer's own error wrapping system, consisting of a few types and functions, aimed at typization, wrapping and unwrapping of errors, as well as saving error stack traces. Since Go 1.13 now provides its own error wrapping mechanism and a few related functions, it makes sense to switch to it. While doing that, improve some error messages so that they start with "error", "unable to", or "can't". A few things that are worth mentioning: 1. We lose stack traces (which were never shown anyway). 2. Users of libcontainer that relied on particular errors (like ContainerNotExists) need to switch to using errors.Is with the new errors defined in error.go. 3. encoding/json is unable to unmarshal the built-in error type, so we have to introduce initError and wrap the errors into it (basically passing the error as a string). This is the same as it was before, just a tad simpler (actually the initError is a type that got removed in commit afa844311; also suddenly ierr variable name makes sense now). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-24 10:21:04 -07:00
Kir Kolyshkin	a7cfb23b88	*: stop using pkg/errors Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	c6fed264da	libct/keys: stop using pkg/errors Use fmt.Errorf with %w instead. Convert the users to the new wrapping. This fixes an errorlint warning. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Sebastiaan van Stijn	b45fbd43b8	errcheck: libcontainer Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-05-20 14:19:26 +02:00

1 2 3

105 Commits