zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-05 15:37:02 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	121192ade6	libct: reset CPU affinity by default In certain deployments, it's possible for runc to be spawned by a process with a restrictive cpumask (such as from a systemd unit with CPUAffinity=... configured) which will be inherited by runc and thus the container process by default. The cpuset cgroup used to reconfigure the cpumask automatically for joining processes, but kcommit da019032819a ("sched: Enforce user requested affinity") changed this behaviour in Linux 6.2. The solution is to try to emulate the expected behaviour by resetting our cpumask to correspond with the configured cpuset (in the case of "runc exec", if the user did not configure an alternative one). Normally we would have to parse /proc/stat and /sys/fs/cgroup, but luckily sched_setaffinity(2) will transparently convert an all-set cpumask (even if it has more entries than the number of CPUs on the system) to the correct value for our usecase. For some reason, in our CI it seems that rootless --systemd-cgroup results in the cpuset (presumably temporarily?) being configured such that sched_setaffinity(2) will allow the full set of CPUs. For this particular case, all we care about is that it is different to the original set, so include some special-casing (but we should probably investigate this further...). Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn> Reported-by: Martin Sivak <msivak@redhat.com> Reported-by: Peter Hunt <pehunt@redhat.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2025-08-28 08:25:46 +10:00
lfbzhm	4d4cedd650	Merge pull request #4796 from astrawind/fix/seccomp-agent-conn-leak libcontainer: close seccomp agent connection to prevent resource leaks	2025-07-04 00:08:21 +08:00
Pavel Liubimov	aa0e7989c4	libcontainer: close seccomp agent connection to prevent resource leaks Add missing defer conn.Close(). Signed-off-by: Pavel Liubimov <prlyubimov@gmail.com>	2025-07-01 12:31:55 +03:00
HirazawaUi	1b39997e73	Preventing containers from being unable to be deleted Signed-off-by: HirazawaUi <695097494plus@gmail.com>	2025-06-19 20:17:50 +08:00
Antonio Ojea	8d180e9658	Add support for Linux Network Devices Implement support for passing Linux Network Devices to the container network namespace. The network device is passed during the creation of the container, before the process is started. It implements the logic defined in the OCI runtime specification. Signed-off-by: Antonio Ojea <aojea@google.com>	2025-06-18 15:52:30 +01:00
Kir Kolyshkin	17570625c0	Use for range over integers This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-31 17:15:06 -07:00
lfbzhm	346c80d714	libct: replace unix.Kill with os.Process.Signal Because we should switch to unix.PidFDSendSignal in new kernels, it has been supported in go runtime. We don't need to add fall back to unix.Kill code here. Signed-off-by: lifubang <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-07 14:07:47 -08:00
Kir Kolyshkin	10ca66bff5	runc exec: implement CPU affinity As per - https://github.com/opencontainers/runtime-spec/pull/1253 - https://github.com/opencontainers/runtime-spec/pull/1261 CPU affinity can be set in two ways: 1. When creating/starting a container, in config.json's Process.ExecCPUAffinity, which is when applied to all execs. 2. When running an exec, in process.json's CPUAffinity, which applied to a given exec and overrides the value from (1). Add some basic tests. Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a process to that of a container's cgroup, as soon as it is moved to that cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that. Because of the above, - it's impossible to really test initial CPU affinity without adding debug logging to libcontainer/nsenter; - for older kernels, there can be a brief moment when exec's affinity is different than either initial or final affinity being set; - exec's final CPU affinity, if not specified, can be different depending on the kernel, therefore we don't test it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-03-02 19:17:41 -08:00
Kir Kolyshkin	a75076b4a4	Switch to opencontainers/cgroups This removes libcontainer/cgroups packages and starts using those from github.com/opencontainers/cgroups repo. Mostly generated by: git rm -f libcontainer/cgroups find . -type f -name "*.go" -exec sed -i \ 's\|github.com/opencontainers/runc/libcontainer/cgroups\|github.com/opencontainers/cgroups\|g' \ {} + go get github.com/opencontainers/cgroups@v0.0.1 make vendor gofumpt -w . Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2025-02-28 15:20:33 -08:00
lfbzhm	d48d9cfefc	Merge pull request #4459 from kolyshkin/prio-nits Fixups to scheduler/priority settings	2024-12-25 23:41:27 +08:00
Kir Kolyshkin	5d3942eec3	libct: unify IOPriority setting For some reason, io priority is set in different places between runc start/run and runc exec: - for runc start/run, it is done in the middle of (linuxStandardInit).Init, close to the place where we exec runc init. - for runc exec, it is done much earlier, in (setnsProcess) start(). Let's move setIOPriority call for runc exec to (linuxSetnsInit).Init, so it is in the same logical place as for runc start/run. Also, move the function itself to init_linux.go as it's part of init. Should not have any visible effect, except part of runc init is run with a different I/O priority. While at it, rename setIOPriority to setupIOPriority, and make it accept the whole configs.Config, for uniformity with other similar functions. Fixes: `bfbd0305` ("Add I/O priority") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:15:31 -08:00
Kir Kolyshkin	2dc3ea4b87	libct: simplify setIOPriority/setupScheduler calls Move the nil check inside, simplifying the callers. Fixes: `bfbd0305` ("Add I/O priority") Fixes: `770728e1` ("Support `process.scheduler`") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 18:06:20 -08:00
Kir Kolyshkin	93091e6ac2	libct: don't pass SpecState to init unless needed SpecState field of initConfig is only needed to run hooks that are executed inside a container -- namely CreateContainer and StartContainer. If these hooks are not configured, there is no need to fill, marshal and unmarshal SpecState. While at it, inline updateSpecState as it is trivial and only has one user. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:52:15 -08:00
Kir Kolyshkin	8afeb58398	libct: add/use configs.HasHook This allows to omit a call to c.currentOCIState (which can be somewhat costly when there are many annotations) when the hooks of a given kind won't be run. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-12-22 17:47:09 -08:00
lfbzhm	171c414904	refactor init and setns process Introduce a common parent struct `containerProcess`, let both `initProcess` and `setnsProcess` are inherited from it. Signed-off-by: lfbzhm <lifubang@acmcoder.com>	2024-12-21 19:16:01 -08:00
Kir Kolyshkin	b5bdf592f2	libct: rm initWaiter This initWaiter logic was introduced by commit `4ecff8d9`, but since the logic of /proc/self/exe was moved out of runc init in commit `0e9a335`, this seems unnecessary to have initWaiter. Remove it. This essentially reverts commit `4ecff8d9`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-10-17 08:05:42 -07:00
Kir Kolyshkin	30f8f51eab	runc create/run: warn on rootless + shared pidns + no cgroup Shared pid namespace means `runc kill` (or `runc delete -f`) have to kill all container processes, not just init. To do so, it needs a cgroup to read the PIDs from. If there is no cgroup, processes will be leaked, and so such configuration is bad and should not be allowed. To keep backward compatibility, though, let's merely warn about this for now. Alas, the only way to know if cgroup access is available is by returning an error from Manager.Apply. Amend fs cgroup managers to do so (systemd doesn't need it, since v1 can't work with rootless, and cgroup v2 does not have a special rootless case). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-09-17 22:49:37 -07:00
Sebastiaan van Stijn	30b530ca94	libct/userns: split userns detection from internal userns code Commit `4316df8b53` isolated RunningInUserNS to a separate package to make it easier to consume without bringing in additional dependencies, and with the potential to move it separate in a similar fashion as libcontainer/user was moved to a separate module in commit `ca32014adb`. While RunningInUserNS is fairly trivial to implement, it (or variants of this utility) is used in many codebases, and moving to a separate module could consolidate those implementations, as well as making it easier to consume without large dependency trees (when being a package as part of a larger code base). Commit `1912d5988b` and follow-ups introduced cgo code into the userns package, and code introduced in those commits are not intended for external use, therefore complicating the potential of moving the userns package separate. This commit moves the new code to a separate package; some of this code was included in v1.1.11 and up, but I could not find external consumers of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and `Handles` types (added in `ba0b5e2698`) only exist in main and in non-stable releases (v1.2.0-rc.x), so don't need an alias / deprecation. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2024-06-30 20:06:30 +02:00
Kir Kolyshkin	1c505fffdc	Revert "Set temporary single CPU affinity..." There's too much logic here figuring out which CPUs to use. Runc is a low level tool and is not supposed to be that "smart". What's worse, this logic is executed on every exec, making it slower. Some of the logic in (*setnsProcess).start is executed even if no annotation is set, thus making ALL execs slow. Also, this should be a property of a process, rather than annotation. The plan is to rework this. This reverts commit `afc23e3397`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-06-10 06:31:03 +08:00
lifubang	a853a82677	runc exec: setupRlimits after syscall.rlimit.init() completed Issue: https://github.com/opencontainers/runc/issues/4195 Since https://go-review.googlesource.com/c/go/+/476097, there is a get/set race between runc exec and syscall.rlimit.init, so we need to call setupRlimits after syscall.rlimit.init() completed. Signed-off-by: lifubang <lifubang@acmcoder.com>	2024-05-08 10:40:07 +00:00
Cédric Clerget	afc23e3397	Set temporary single CPU affinity before cgroup cpuset transition. This handles a corner case when joining a container having all the processes running exclusively on isolated CPU cores to force the kernel to schedule runc process on the first CPU core within the cgroups cpuset. The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic scheduling behavior by distributing tasks across CPU cores within the cgroups cpuset. Some intensive real-time application are relying on this deterministic behavior and use the first CPU core to run a slow thread while other CPU cores are fully used by real-time threads with SCHED_FIFO policy. Such applications prevents runc process from joining a container when the runc process is randomly scheduled on a CPU core owned by a real-time thread. Introduces isolated CPU affinity transition OCI runtime annotation org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore the behavior during runc exec. Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes. Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>	2024-04-16 08:59:49 +02:00
utam0k	bfbd0305ba	Add I/O priority Signed-off-by: utam0k <k0ma@utam0k.jp>	2024-03-30 22:31:54 +09:00
Aleksa Sarai	ba0b5e2698	libcontainer: remove all mount logic from nsexec With open_tree(OPEN_TREE_CLONE), it is possible to implement both the id-mapped mounts and bind-mount source file descriptor logic entirely in Go without requiring any complicated handling from nsexec. However, implementing it the naive way (do the OPEN_TREE_CLONE in the host namespace before the rootfs is set up -- which is what the existing implementation did) exposes issues in how mount ordering (in particular when handling mount sources from inside the container rootfs, but also in relation to mount propagation) was handled for idmapped mounts and bind-mount sources. In order to solve this problem completely, it is necessary to spawn a thread which joins the container mount namespace and provides mountfds when requested by the rootfs setup code (ensuring that the mount order and mount propagation of the source of the bind-mount are handled correctly). While the need to join the mount namespace leads to other complicated (such as with the usage of /proc/self -- fixed in a later patch) the resulting code is still reasonable and is the only real way to solve the issue. This allows us to reduce the amount of C code we have in nsexec, as well as simplifying a whole host of places that were made more complicated with the addition of id-mapped mounts and the bind sourcefd logic. Because we join the container namespace, we can continue to use regular O_PATH file descriptors for non-id-mapped bind-mount sources (which means we don't have to raise the kernel requirement for that case). In addition, we can easily add support for id-mappings that don't match the container's user namespace. The approach taken here is to use Go's officially supported mechanism for spawning a process in a user namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a different process. The most efficient way to implement this would be to do clone() in cgo directly to run a function that just does kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out this approach is too slow. It should be noted that the included micro-benchmark seems to indicate this is Fast Enough(TM): goos: linux goarch: amd64 pkg: github.com/opencontainers/runc/libcontainer/userns cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz BenchmarkSpawnProc BenchmarkSpawnProc-8 1670 770065 ns/op Fixes: `fda12ab101` ("Support idmap mounts on volumes") Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:40 +11:00
Aleksa Sarai	8da42aaec2	sync: split init config (stream) and synchronisation (seqpacket) pipes We have different requirements for the initial configuration and initWaiter pipe (just send netlink and JSON blobs with no complicated handling needed for message coalescing) and the packet-based synchronisation pipe. Tests with switching everything to SOCK_SEQPACKET lead to endless issues with runc hanging on start-up because random things would try to do short reads (which SOCK_SEQPACKET will not allow and the Go stdlib explicitly treats as a streaming source), so splitting it was the only reasonable solution. Even doing somewhat dodgy tricks such as adding a Read() wrapper which actually calls ReadPacket() and makes it seem like a stream source doesn't work -- and is a bit too magical. One upside is that doing it this way makes the difference between the modes clearer -- INITPIPE is still used for initWaiter syncrhonisation but aside from that all other synchronisation is done by SYNCPIPE. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:31:14 +08:00
Aleksa Sarai	ccc76713a7	sync: rename procResume -> procHooksDone The old name was quite confusing, and with the addition of the procMountPlease sync message there are now multiple sync messages that are related to "resuming" runc-init. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-09-24 20:02:11 +08:00
Kir Kolyshkin	41778ddc2c	Fix for host mount ns containers If the container does not have own mount namespace configured (i.e. it shares the mount namespace with the host), its "prestart" (obsoleted) and "createRuntime" hooks are called twice, and its cgroups and Intel RDT settings are also applied twice. The code being removed was originally added by commit `2f2764984` ("Move pre-start hooks after container mounts", Feb 17 2016). At that time, the syncParentHooks() was called from setupRootfs(), which was only used when the container config has mount namespace (NEWNS) enabled. Later, commit `244c9fc426` ("*: console rewrite", Jun 4 2016) spli the relevant part of setupRootfs() into prepareRootfs(). It was still called conditionally (only if mount namespace was enabled). Finally, commit `91ca331474` ("chroot when no mount namespaces is provided", Jan 25 2018) removed the above condition, meaning prepareRootfs(), and thus syncParentHooks(), is now called for any container. Meaning, the special case for when mount namespace is not enabled is no longer needed. Remove it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-28 12:45:39 -07:00
Kir Kolyshkin	6a4870e4ac	libct: better errors for hooks When a hook has failed, the error message looks like this: > error running hook: error running hook #1: exit status 1, stdout: ... The two problems here are: 1. it is impossible to know what kind of hook it was; 2. "error running hook" stuttering; Change that to > error running createContainer hook #1: exit status 1, stdout: ... Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-24 19:44:05 -07:00
Aleksa Sarai	f81ef1493d	libcontainer: sync: cleanup synchronisation code This includes quite a few cleanups and improvements to the way we do synchronisation. The core behaviour is unchanged, but switching to embedding json.RawMessage into the synchronisation structure will allow us to do more complicated synchronisation operations in future patches. The file descriptor passing through the synchronisation system feature will be used as part of the idmapped-mount and bind-mount-source features when switching that code to use the new mount API outside of nsexec.c. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	c6e7b1a8ec	libct: initProcess.start: fix sync logic The code in this function became quite complicated and not entirely correct over time. As a result, if an error is returned from parseSync, it might end up stuck waiting for the child to finish. 1. Let's not wait() for the child twice. We already do it in the defer statement (call p.terminate()) when we are returning an error. 2. Remove sentResume and sentRun since we do not want to check if these were sent or not. Instead, introduce and check seenProcReady, as procReady is always expected from runc init. 3. Eliminate the possibility to wrap nil as an error. 4. Make sure we always call shutdown on the sync socket, and do not let shutdown error shadow the ierr. This fixes the issue of stuck `runc runc` with the optimization patch (sending procSeccompDone earlier) applied. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-08-15 19:54:24 -07:00
Kir Kolyshkin	9583b3d1c2	libct: move killing logic to container.Signal By default, the container has its own PID namespace, and killing (with SIGKILL) its init process from the parent PID namespace also kills all the other processes. Obviously, it does not work that way when the container is sharing its PID namespace with the host or another container, since init is no longer special (it's not PID 1). In this case, killing container's init will result in a bunch of other processes left running (and thus the inability to remove the cgroup). The solution to the above problem is killing all the container processes, not just init. The problem with the current implementation is, the killing logic is implemented in libcontainer's initProcess.wait, and thus only available to libcontainer users, but not the runc kill command (which uses nonChildProcess.kill and does not use wait at all). So, some workarounds exist: - func destroy(c *Container) calls signalAllProcesses; - runc kill implements -a flag. This code became very tangled over time. Let's simplify things by moving the killing all processes from initProcess.wait to container.Signal, and documents the new behavior. In essence, this also makes `runc kill` to automatically kill all container processes when the container does not have its own PID namespace. Document that as well. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-06-08 09:29:25 -07:00
Vipul Newaskar	9fc707e703	Fixed init state error variable Init State Error message was using the err variable instead of uerr, which has been fixed now. The error message should not show "nil" now. Signed-off-by: Vipul Newaskar <vipulnewaskar7@gmail.com>	2022-11-15 09:41:16 +05:30
Kir Kolyshkin	102b8abd26	libct: rm BaseContainer and Container interfaces The only implementation of these is linuxContainer. It does not make sense to have an interface with a single implementation, and we do not foresee other types of containers being added to runc. Remove BaseContainer and Container interfaces, moving their methods documentation to linuxContainer. Rename linuxContainer to Container. Adopt users from using interface to using struct. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-23 11:04:12 -07:00
Kir Kolyshkin	89733cd055	Format sources using gofumpt 0.2.1 ... which adds a wee more whitespace fixes. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-07 10:42:01 -08:00
Kir Kolyshkin	dbd990d555	libct: rm intelrtd.Manager interface, NewIntelRdtManager Remove intelrtd.Manager interface, since we only have a single implementation, and do not expect another one. Rename intelRdtManager to Manager, and modify its users accordingly. Remove NewIntelRdtManager from factory. Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the feature is not available. Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew. Add internal function newManager to be used for tests (to make sure some testing is done even when the feature is not available in kernel/hardware). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 17:33:03 -08:00
Kir Kolyshkin	bb6a838876	libct: initContainer: rename Id -> ID Since the next commit is going to touch this structure, our CI (lint-extra) is about to complain about improperly named field: > Warning: var-naming: struct field ContainerId should be ContainerID (revive) Make it happy. Brought to use by gopls rename. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 18:59:47 -08:00
Dave Chen	1362291a6d	Avoid non-op when the list of `Hooks` is empty There is no need to run hooks when `Config.Hooks` is just an empty map, (dlv) p p.config.Config.Hooks github.com/opencontainers/runc/libcontainer/configs.Hooks [] Signed-off-by: Dave Chen <dave.chen@arm.com>	2021-11-19 22:37:27 +08:00
Kir Kolyshkin	7563a8f06d	libct: wrap more unix errors When I tried to start a rootless container under a different/wrong user, I got: $ ../runc/runc --systemd-cgroup --root /tmp/runc.$$ run 445 ERRO[0000] runc run failed: operation not permitted This is obviously not good enough. With this commit, the error is: ERRO[0000] runc run failed: fchown fd 9: operation not permitted Alas, there are still some code that returns unwrapped errnos from various unix calls. This is a followup to commit `d8ba4128b2` which wrapped many, but not all, bare unix errors. Do wrap some more, using either os.PathError or os.SyscallError. While at it, - use os.SyscallError instead of os.NewSyscallError; - use errors.Is(err, os.ErrXxx) instead of os.IsXxx(err). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 00:33:59 -08:00
Kir Kolyshkin	0202c398ff	runc exec: implement --cgroup In some setups, multiple cgroups are used inside a container, and sometime there is a need to execute a process in a particular sub-cgroup (in case of cgroup v1, for a particular controller). This is what this commit implements. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-27 10:25:42 -07:00
Kir Kolyshkin	39914db679	runc exec: don't skip non-existing cgroups The function used here, cgroups.EnterPid, silently skips non-existing paths, and it does not look like a good idea to do so for an existing container with already configured cgroups. Switch to cgroups.WriteCgroupProc which does not do that, so in case a cgroup does not exist, we'll get an error. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 19:26:57 -07:00
Alban Crequy	2b025c0173	Implement Seccomp Notify This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 13:04:24 +02:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
Kir Kolyshkin	5110bd2fc0	nsenter: remove cgroupns sync mechanism As pointed out in TODO item added by commit `64bb59f59`, it is not necessary to have a special sync mechanism for cgroupns, as the parent adds runc init to cgroup way earlier (before sending nl bootstrap data. This sync was added by commit `df3fa115f9`, which was also added a second cgroup manager.Apply() call, later removed in commit `d1ba8e39f8`. It seems the original author had the idea to wait for that second Apply(). Fixes: `df3fa115f9` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-07-27 12:17:47 -07:00
Kir Kolyshkin	e918d02139	libcontainer: rm own error system This removes libcontainer's own error wrapping system, consisting of a few types and functions, aimed at typization, wrapping and unwrapping of errors, as well as saving error stack traces. Since Go 1.13 now provides its own error wrapping mechanism and a few related functions, it makes sense to switch to it. While doing that, improve some error messages so that they start with "error", "unable to", or "can't". A few things that are worth mentioning: 1. We lose stack traces (which were never shown anyway). 2. Users of libcontainer that relied on particular errors (like ContainerNotExists) need to switch to using errors.Is with the new errors defined in error.go. 3. encoding/json is unable to unmarshal the built-in error type, so we have to introduce initError and wrap the errors into it (basically passing the error as a string). This is the same as it was before, just a tad simpler (actually the initError is a type that got removed in commit afa844311; also suddenly ierr variable name makes sense now). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-24 10:21:04 -07:00
Sebastiaan van Stijn	b45fbd43b8	errcheck: libcontainer Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-05-20 14:19:26 +02:00
Kir Kolyshkin	3f65946756	libct/cg: make Set accept configs.Resources A cgroup manager's Set method sets cgroup resources, but historically it was accepting configs.Cgroups. Refactor it to accept resources only. This is an improvement from the API point of view, as the method can not change cgroup configuration (such as path to the cgroup etc), it can only set (modify) its resources/limits. This also lays the foundation for complicated resource updates, as now Set has two sets of resources -- the one that was previously specified during cgroup manager creation (or the previous Set), and the one passed in the argument, so it could deduce the difference between these. This is a long term goal though. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-29 15:24:19 -07:00
Kir Kolyshkin	4ecff8d9d8	start: don't kill runc init too early The stars can be aligned in a way that results in runc to leave a stale bind mount in container's state directory, which manifests itself later, while trying to remove the container, in an error like this: > remove /run/runc/test2: unlinkat /run/runc/test2/runc.W24K2t: device or resource busy The stale mount happens because runc start/run/exec kills runc init while it is inside ensure_cloned_binary(). One such scenario is when a unified cgroup resource is specified for cgroup v1, a cgroup manager's Apply returns an error (as of commit `b006f4a180`), and when (*initProcess).start() kills runc init just after it was started. One solution is NOT to kill runc init too early. To achieve that, amend the libcontainer/nsenter code to send a \0 byte to signal that it is past the initial setup, and make start() (for both run/start and exec) wait for this byte before proceeding with kill on an error path. While at it, improve some error messages. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-31 14:36:52 -07:00
Kir Kolyshkin	201d60c51d	runc run/start/exec: fix init log forwarding race Sometimes debug.bats test cases are failing like this: > not ok 27 global --debug to --log --log-format 'json' > # (in test file tests/integration/debug.bats, line 77) > # `[[ "${output}" == "child process in init()" ]]' failed It happens more when writing to disk. This issue is caused by the fact that runc spawns log forwarding goroutine (ForwardLogs) but does not wait for it to finish, resulting in missing debug lines from nsexec. ForwardLogs itself, though, never finishes, because it reads from a reading side of a pipe which writing side is not closed. This is especially true in case of runc create, which spawns runc init and exits; meanwhile runc init waits on exec fifo for arbitrarily long time before doing execve. So, to fix the failure described above, we need to: 1. Make runc create/run/exec wait for ForwardLogs to finish; 2. Make runc init close its log pipe file descriptor (i.e. the one which value is passed in _LIBCONTAINER_LOGPIPE environment variable). This is exactly what this commit does: 1. Amend ForwardLogs to return a channel, and wait for it in start(). 2. In runc init, save the log fd and close it as late as possible. PS I have to admit I still do not understand why an explicit close of log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before the execve which (thanks to CLOEXEC) closes the fd anyway. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-03-25 19:18:55 -07:00
Kir Kolyshkin	38b2dd391d	runc exec: report possible OOM kill An exec may fail due to memory shortage (cgroup memory limits being too tight), and an error message provided in this case is clueless: > $ sudo ../runc/runc exec xx56 top > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer Same as the previous commit for run/start, check the OOM kill counter and report an OOM kill. The differences from run are 1. The container is already running and OOM kill counter might not be zero. This is why we have to read the counter before exec and after it failed. 2. An unrelated OOM kill event might occur in parallel with our exec (and I see no way to find out which process was killed, except to parse kernel logs which seems excessive and not very reliable). This is why we report _possible_ OOM kill. With this commit, the error message looks like: > ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-02-23 16:16:33 -08:00
Kir Kolyshkin	5d0ffbf9c8	runc start/run: report OOM In some cases, container init fails to start because it is killed by the kernel OOM killer. The errors returned by runc in such cases are semi-random and rather cryptic. Below are a few examples. On cgroup v1 + systemd cgroup driver: > process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe > process_linux.go:352: getting the final child's pid from pipe caused: EOF On cgroup v2: > process_linux.go:495: container init caused: read init-p: connection reset by peer > process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe This commits adds the OOM method to cgroup managers, which tells whether the container was OOM-killed. In case that has happened, the original error is discarded (unless --debug is set), and the new OOM error is reported instead: > ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?) Also, fix the rootless test cases that are failing because they expect an error in the first line, and we have an additional warning now: > unable to get oom kill count" error="no directory specified for memory.oom_control Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-02-23 16:15:33 -08:00
Chaitanya Bandi	0aa0fae393	Kill all processes in cgroup even if init process Wait fails If the cgroup's init process doesn't complete successfully, Wait returns a non-nil error. We should still kill all the process in the cgroup if process namespace is shared. Otherwise, it may result in process leak. Fixes #2632 Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>	2020-10-08 01:26:34 +00:00

1 2 3

132 Commits