Commit Graph

132 Commits

Author SHA1 Message Date
Aleksa Sarai
121192ade6 libct: reset CPU affinity by default
In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but kcommit da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
"runc exec", if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

For some reason, in our CI it seems that rootless --systemd-cgroup
results in the cpuset (presumably temporarily?) being configured such
that sched_setaffinity(2) will allow the full set of CPUs. For this
particular case, all we care about is that it is different to the
original set, so include some special-casing (but we should probably
investigate this further...).

Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Martin Sivak <msivak@redhat.com>
Reported-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-08-28 08:25:46 +10:00
lfbzhm
4d4cedd650 Merge pull request #4796 from astrawind/fix/seccomp-agent-conn-leak
libcontainer: close seccomp agent connection to prevent resource leaks
2025-07-04 00:08:21 +08:00
Pavel Liubimov
aa0e7989c4 libcontainer: close seccomp agent connection to prevent resource leaks
Add missing defer conn.Close().

Signed-off-by: Pavel Liubimov <prlyubimov@gmail.com>
2025-07-01 12:31:55 +03:00
HirazawaUi
1b39997e73 Preventing containers from being unable to be deleted
Signed-off-by: HirazawaUi <695097494plus@gmail.com>
2025-06-19 20:17:50 +08:00
Antonio Ojea
8d180e9658 Add support for Linux Network Devices
Implement support for passing Linux Network Devices to the container
network namespace.

The network device is passed during the creation of the container,
before the process is started.

It implements the logic defined in the OCI runtime specification.

Signed-off-by: Antonio Ojea <aojea@google.com>
2025-06-18 15:52:30 +01:00
Kir Kolyshkin
17570625c0 Use for range over integers
This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-31 17:15:06 -07:00
lfbzhm
346c80d714 libct: replace unix.Kill with os.Process.Signal
Because we should switch to unix.PidFDSendSignal in new kernels, it has
been supported in go runtime. We don't need to add fall back to
unix.Kill code here.

Signed-off-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-07 14:07:47 -08:00
Kir Kolyshkin
10ca66bff5 runc exec: implement CPU affinity
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261

CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
   Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
   applied to a given exec and overrides the value from (1).

Add some basic tests.

Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.

Because of the above,
 - it's impossible to really test initial CPU affinity without adding
   debug logging to libcontainer/nsenter;
 - for older kernels, there can be a brief moment when exec's affinity
   is different than either initial or final affinity being set;
 - exec's final CPU affinity, if not specified, can be different
   depending on the kernel, therefore we don't test it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-02 19:17:41 -08:00
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
lfbzhm
d48d9cfefc Merge pull request #4459 from kolyshkin/prio-nits
Fixups to scheduler/priority settings
2024-12-25 23:41:27 +08:00
Kir Kolyshkin
5d3942eec3 libct: unify IOPriority setting
For some reason, io priority is set in different places between runc
start/run and runc exec:

 - for runc start/run, it is done in the middle of (*linuxStandardInit).Init,
   close to the place where we exec runc init.
 - for runc exec, it is done much earlier, in (*setnsProcess) start().

Let's move setIOPriority call for runc exec to (*linuxSetnsInit).Init,
so it is in the same logical place as for runc start/run.

Also, move the function itself to init_linux.go as it's part of init.

Should not have any visible effect, except part of runc init is run with
a different I/O priority.

While at it, rename setIOPriority to setupIOPriority, and make it accept
the whole *configs.Config, for uniformity with other similar functions.

Fixes: bfbd0305 ("Add I/O priority")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:15:31 -08:00
Kir Kolyshkin
2dc3ea4b87 libct: simplify setIOPriority/setupScheduler calls
Move the nil check inside, simplifying the callers.

Fixes: bfbd0305 ("Add I/O priority")
Fixes: 770728e1 ("Support `process.scheduler`")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 18:06:20 -08:00
Kir Kolyshkin
93091e6ac2 libct: don't pass SpecState to init unless needed
SpecState field of initConfig is only needed to run hooks that are
executed inside a container -- namely CreateContainer and
StartContainer.

If these hooks are not configured, there is no need to fill, marshal and
unmarshal SpecState.

While at it, inline updateSpecState as it is trivial and only has one user.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:52:15 -08:00
Kir Kolyshkin
8afeb58398 libct: add/use configs.HasHook
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:47:09 -08:00
lfbzhm
171c414904 refactor init and setns process
Introduce a common parent struct `containerProcess`,
let both `initProcess` and `setnsProcess` are inherited
from it.

Signed-off-by: lfbzhm <lifubang@acmcoder.com>
2024-12-21 19:16:01 -08:00
Kir Kolyshkin
b5bdf592f2 libct: rm initWaiter
This initWaiter logic was introduced by commit 4ecff8d9, but since the logic of
/proc/self/exe was moved out of runc init in commit 0e9a335, this
seems unnecessary to have initWaiter.

Remove it.

This essentially reverts commit 4ecff8d9.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-10-17 08:05:42 -07:00
Kir Kolyshkin
30f8f51eab runc create/run: warn on rootless + shared pidns + no cgroup
Shared pid namespace means `runc kill` (or `runc delete -f`) have to
kill all container processes, not just init. To do so, it needs a cgroup
to read the PIDs from.

If there is no cgroup, processes will be leaked, and so such
configuration is bad and should not be allowed. To keep backward
compatibility, though, let's merely warn about this for now.

Alas, the only way to know if cgroup access is available is by returning
an error from Manager.Apply. Amend fs cgroup managers to do so (systemd
doesn't need it, since v1 can't work with rootless, and cgroup v2 does
not have a special rootless case).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-17 22:49:37 -07:00
Sebastiaan van Stijn
30b530ca94 libct/userns: split userns detection from internal userns code
Commit 4316df8b53 isolated RunningInUserNS
to a separate package to make it easier to consume without bringing in
additional dependencies, and with the potential to move it separate in
a similar fashion as libcontainer/user was moved to a separate module
in commit ca32014adb. While RunningInUserNS
is fairly trivial to implement, it (or variants of this utility) is used
in many codebases, and moving to a separate module could consolidate
those implementations, as well as making it easier to consume without
large dependency trees (when being a package as part of a larger code
base).

Commit 1912d5988b and follow-ups introduced
cgo code into the userns package, and code introduced in those commits
are not intended for external use, therefore complicating the potential
of moving the userns package separate.

This commit moves the new code to a separate package; some of this code
was included in v1.1.11 and up, but I could not find external consumers
of `GetUserNamespaceMappings` and `IsSameMapping`. The `Mapping` and
`Handles` types (added in ba0b5e2698) only
exist in main and in non-stable releases (v1.2.0-rc.x), so don't need
an alias / deprecation.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-30 20:06:30 +02:00
Kir Kolyshkin
1c505fffdc Revert "Set temporary single CPU affinity..."
There's too much logic here figuring out which CPUs to use. Runc is a
low level tool and is not supposed to be that "smart". What's worse,
this logic is executed on every exec, making it slower. Some of the
logic in (*setnsProcess).start is executed even if no annotation is set,
thus making ALL execs slow.

Also, this should be a property of a process, rather than annotation.

The plan is to rework this.

This reverts commit afc23e3397.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-06-10 06:31:03 +08:00
lifubang
a853a82677 runc exec: setupRlimits after syscall.rlimit.init() completed
Issue: https://github.com/opencontainers/runc/issues/4195
Since https://go-review.googlesource.com/c/go/+/476097, there is
a get/set race between runc exec and syscall.rlimit.init, so we
need to call setupRlimits after syscall.rlimit.init() completed.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:07 +00:00
Cédric Clerget
afc23e3397 Set temporary single CPU affinity before cgroup cpuset transition.
This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.

The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.

Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.

Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.

Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
2024-04-16 08:59:49 +02:00
utam0k
bfbd0305ba Add I/O priority
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-03-30 22:31:54 +09:00
Aleksa Sarai
ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Aleksa Sarai
8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
Aleksa Sarai
ccc76713a7 sync: rename procResume -> procHooksDone
The old name was quite confusing, and with the addition of the
procMountPlease sync message there are now multiple sync messages that
are related to "resuming" runc-init.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:02:11 +08:00
Kir Kolyshkin
41778ddc2c Fix for host mount ns containers
If the container does not have own mount namespace configured (i.e. it
shares the mount namespace with the host), its "prestart" (obsoleted)
and "createRuntime" hooks are called twice, and its cgroups and Intel
RDT settings are also applied twice.

The code being removed was originally added by commit 2f2764984 ("Move
pre-start hooks after container mounts", Feb 17 2016). At that time,
the syncParentHooks() was called from setupRootfs(), which was only
used when the container config has mount namespace (NEWNS) enabled.

Later, commit 244c9fc426 ("*: console rewrite", Jun 4 2016) spli
the relevant part of setupRootfs() into prepareRootfs(). It was still
called conditionally (only if mount namespace was enabled).

Finally, commit 91ca331474 ("chroot when no mount namespaces is
provided", Jan 25 2018) removed the above condition, meaning
prepareRootfs(), and thus syncParentHooks(), is now called for any
container.

Meaning, the special case for when mount namespace is not enabled is no
longer needed.

Remove it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-28 12:45:39 -07:00
Kir Kolyshkin
6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Aleksa Sarai
f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin
c6e7b1a8ec libct: initProcess.start: fix sync logic
The code in this function became quite complicated and not entirely
correct over time. As a result, if an error is returned from parseSync,
it might end up stuck waiting for the child to finish.

1. Let's not wait() for the child twice. We already do it in the
defer statement (call p.terminate()) when we are returning an error.

2. Remove sentResume and sentRun since we do not want to check if
these were sent or not. Instead, introduce and check seenProcReady, as
procReady is always expected from runc init.

3. Eliminate the possibility to wrap nil as an error.

4. Make sure we always call shutdown on the sync socket, and do not let
   shutdown error shadow the ierr.

This fixes the issue of stuck `runc runc` with the optimization patch
(sending procSeccompDone earlier) applied.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin
9583b3d1c2 libct: move killing logic to container.Signal
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.

Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).

The solution to the above problem is killing all the container
processes, not just init.

The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
 - func destroy(c *Container) calls signalAllProcesses;
 - runc kill implements -a flag.

This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.

In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-06-08 09:29:25 -07:00
Vipul Newaskar
9fc707e703 Fixed init state error variable
Init State Error message was using the err variable instead of uerr, which has been fixed now.
The error message should not show "nil" now.

Signed-off-by: Vipul Newaskar <vipulnewaskar7@gmail.com>
2022-11-15 09:41:16 +05:30
Kir Kolyshkin
102b8abd26 libct: rm BaseContainer and Container interfaces
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.

Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.

Rename linuxContainer to Container.

Adopt users from using interface to using struct.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-23 11:04:12 -07:00
Kir Kolyshkin
89733cd055 Format sources using gofumpt 0.2.1
... which adds a wee more whitespace fixes.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-07 10:42:01 -08:00
Kir Kolyshkin
dbd990d555 libct: rm intelrtd.Manager interface, NewIntelRdtManager
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.

Rename intelRdtManager to Manager, and modify its users accordingly.

Remove NewIntelRdtManager from factory.

Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.

Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.

Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 17:33:03 -08:00
Kir Kolyshkin
bb6a838876 libct: initContainer: rename Id -> ID
Since the next commit is going to touch this structure, our CI
(lint-extra) is about to complain about improperly named field:

>  Warning: var-naming: struct field ContainerId should be ContainerID (revive)

Make it happy.

Brought to use by gopls rename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 18:59:47 -08:00
Dave Chen
1362291a6d Avoid non-op when the list of Hooks is empty
There is no need to run hooks when `Config.Hooks` is just an empty map,

(dlv) p p.config.Config.Hooks
github.com/opencontainers/runc/libcontainer/configs.Hooks []

Signed-off-by: Dave Chen <dave.chen@arm.com>
2021-11-19 22:37:27 +08:00
Kir Kolyshkin
7563a8f06d libct: wrap more unix errors
When I tried to start a rootless container under a different/wrong user,
I got:

	$ ../runc/runc --systemd-cgroup --root /tmp/runc.$$ run 445
	ERRO[0000] runc run failed: operation not permitted

This is obviously not good enough. With this commit, the error is:

	ERRO[0000] runc run failed: fchown fd 9: operation not permitted

Alas, there are still some code that returns unwrapped errnos from
various unix calls.

This is a followup to commit d8ba4128b2 which wrapped many, but not
all, bare unix errors. Do wrap some more, using either os.PathError or
os.SyscallError.

While at it,
 - use os.SyscallError instead of os.NewSyscallError;
 - use errors.Is(err, os.ErrXxx) instead of os.IsXxx(err).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-12 00:33:59 -08:00
Kir Kolyshkin
0202c398ff runc exec: implement --cgroup
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-27 10:25:42 -07:00
Kir Kolyshkin
39914db679 runc exec: don't skip non-existing cgroups
The function used here, cgroups.EnterPid, silently skips non-existing
paths, and it does not look like a good idea to do so for an existing
container with already configured cgroups.

Switch to cgroups.WriteCgroupProc which does not do that, so in case
a cgroup does not exist, we'll get an error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-23 19:26:57 -07:00
Alban Crequy
2b025c0173 Implement Seccomp Notify
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.

A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.

As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:

	writing syncT <act>: writing syncT: <err>

By adjusting the code path, now they just look like this
	writing syncT <act>: <err>

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-09-07 13:04:24 +02:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Kir Kolyshkin
5110bd2fc0 nsenter: remove cgroupns sync mechanism
As pointed out in TODO item added by commit 64bb59f59, it is not
necessary to have a special sync mechanism for cgroupns, as the parent
adds runc init to cgroup way earlier (before sending nl bootstrap data.

This sync was added by commit df3fa115f9, which was also added a
second cgroup manager.Apply() call, later removed in commit
d1ba8e39f8. It seems the original author had the idea to wait for
that second Apply().

Fixes: df3fa115f9
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-07-27 12:17:47 -07:00
Kir Kolyshkin
e918d02139 libcontainer: rm own error system
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.

Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.

While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".

A few things that are worth mentioning:

1. We lose stack traces (which were never shown anyway).

2. Users of libcontainer that relied on particular errors (like
   ContainerNotExists) need to switch to using errors.Is with
   the new errors defined in error.go.

3. encoding/json is unable to unmarshal the built-in error type,
   so we have to introduce initError and wrap the errors into it
   (basically passing the error as a string). This is the same
   as it was before, just a tad simpler (actually the initError
   is a type that got removed in commit afa844311; also suddenly
   ierr variable name makes sense now).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 10:21:04 -07:00
Sebastiaan van Stijn
b45fbd43b8 errcheck: libcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:19:26 +02:00
Kir Kolyshkin
3f65946756 libct/cg: make Set accept configs.Resources
A cgroup manager's Set method sets cgroup resources, but historically
it was accepting configs.Cgroups.

Refactor it to accept resources only. This is an improvement from the
API point of view, as the method can not change cgroup configuration
(such as path to the cgroup etc), it can only set (modify) its
resources/limits.

This also lays the foundation for complicated resource updates, as now
Set has two sets of resources -- the one that was previously specified
during cgroup manager creation (or the previous Set), and the one passed
in the argument, so it could deduce the difference between these. This
is a long term goal though.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-29 15:24:19 -07:00
Kir Kolyshkin
4ecff8d9d8 start: don't kill runc init too early
The stars can be aligned in a way that results in runc to leave a stale
bind mount in container's state directory, which manifests itself later,
while trying to remove the container, in an error like this:

> remove /run/runc/test2: unlinkat /run/runc/test2/runc.W24K2t: device or resource busy

The stale mount happens because runc start/run/exec kills runc init
while it is inside ensure_cloned_binary(). One such scenario is when
a unified cgroup resource is specified for cgroup v1, a cgroup manager's
Apply returns an error (as of commit b006f4a180), and when
(*initProcess).start() kills runc init just after it was started.

One solution is NOT to kill runc init too early. To achieve that,
amend the libcontainer/nsenter code to send a \0 byte to signal
that it is past the initial setup, and make start() (for both
run/start and exec) wait for this byte before proceeding with
kill on an error path.

While at it, improve some error messages.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-31 14:36:52 -07:00
Kir Kolyshkin
201d60c51d runc run/start/exec: fix init log forwarding race
Sometimes debug.bats test cases are failing like this:

> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> #   `[[ "${output}" == *"child process in init()"* ]]' failed

It happens more when writing to disk.

This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.

ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.

So, to fix the failure described above, we need to:

 1. Make runc create/run/exec wait for ForwardLogs to finish;

 2. Make runc init close its log pipe file descriptor (i.e.
    the one which value is passed in _LIBCONTAINER_LOGPIPE
    environment variable).

This is exactly what this commit does:

 1. Amend ForwardLogs to return a channel, and wait for it in start().

 2. In runc init, save the log fd and close it as late as possible.

PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-25 19:18:55 -07:00
Kir Kolyshkin
38b2dd391d runc exec: report possible OOM kill
An exec may fail due to memory shortage (cgroup memory limits being too
tight), and an error message provided in this case is clueless:

> $ sudo ../runc/runc exec xx56 top
> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: read init-p: connection reset by peer

Same as the previous commit for run/start, check the OOM kill counter
and report an OOM kill.

The differences from run are

1. The container is already running and OOM kill counter might not be
   zero.  This is why we have to read the counter before exec and after
   it failed.

2. An unrelated OOM kill event might occur in parallel with our exec
   (and I see no way to find out which process was killed, except to
   parse kernel logs which seems excessive and not very reliable).
   This is why we report _possible_ OOM kill.

With this commit, the error message looks like:

> ERRO[0000] exec failed: container_linux.go:367: starting container process caused: process_linux.go:105: possibly OOM-killed caused: read init-p: connection reset by peer

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-23 16:16:33 -08:00
Kir Kolyshkin
5d0ffbf9c8 runc start/run: report OOM
In some cases, container init fails to start because it is killed by
the kernel OOM killer. The errors returned by runc in such cases are
semi-random and rather cryptic. Below are a few examples.

On cgroup v1 + systemd cgroup driver:

> process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe

> process_linux.go:352: getting the final child's pid from pipe caused: EOF

On cgroup v2:

> process_linux.go:495: container init caused: read init-p: connection reset by peer

> process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe

This commits adds the OOM method to cgroup managers, which tells whether
the container was OOM-killed. In case that has happened, the original error
is discarded (unless --debug is set), and the new OOM error is reported
instead:

> ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)

Also, fix the rootless test cases that are failing because they expect
an error in the first line, and we have an additional warning now:

> unable to get oom kill count" error="no directory specified for memory.oom_control

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-23 16:15:33 -08:00
Chaitanya Bandi
0aa0fae393 Kill all processes in cgroup even if init process Wait fails
If the cgroup's init process doesn't complete successfully, Wait returns a
non-nil error. We should still kill all the process in the cgroup if process
namespace is shared. Otherwise, it may result in process leak.

Fixes #2632

Signed-off-by: Chaitanya Bandi <kbandi@cs.stonybrook.edu>
2020-10-08 01:26:34 +00:00