We have quite a few external users of libcontainer/cgroups packages,
and they all have to depend on libcontainer/configs as well.
Let's move cgroup-related configuration to libcontainer/croups.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Shared pid namespace means `runc kill` (or `runc delete -f`) have to
kill all container processes, not just init. To do so, it needs a cgroup
to read the PIDs from.
If there is no cgroup, processes will be leaked, and so such
configuration is bad and should not be allowed. To keep backward
compatibility, though, let's merely warn about this for now.
Alas, the only way to know if cgroup access is available is by returning
an error from Manager.Apply. Amend fs cgroup managers to do so (systemd
doesn't need it, since v1 can't work with rootless, and cgroup v2 does
not have a special rootless case).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There's too much logic here figuring out which CPUs to use. Runc is a
low level tool and is not supposed to be that "smart". What's worse,
this logic is executed on every exec, making it slower. Some of the
logic in (*setnsProcess).start is executed even if no annotation is set,
thus making ALL execs slow.
Also, this should be a property of a process, rather than annotation.
The plan is to rework this.
This reverts commit afc23e3397.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit b6967fa84c moved the functionality of managing cgroup devices
into a separate package, and decoupled libcontainer/cgroups from it.
Yet, some software (e.g. cadvisor) may need to use libcontainer package,
which imports libcontainer/cgroups/devices, thus making it impossible to
use libcontainer without bringing in cgroup/devices dependency.
In fact, we only need to manage devices in runc binary, so move the
import to main.go.
The need to import libct/cg/dev in order to manage devices is already
documented in libcontainer/cgroups, but let's
- update that documentation;
- add a similar note to libcontainer/cgroups/systemd;
- add a note to libct README.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This handles a corner case when joining a container having all
the processes running exclusively on isolated CPU cores to force
the kernel to schedule runc process on the first CPU core within the
cgroups cpuset.
The introduction of the kernel commit
46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 has affected this deterministic
scheduling behavior by distributing tasks across CPU cores within the
cgroups cpuset. Some intensive real-time application are relying on this
deterministic behavior and use the first CPU core to run a slow thread
while other CPU cores are fully used by real-time threads with SCHED_FIFO
policy. Such applications prevents runc process from joining a container
when the runc process is randomly scheduled on a CPU core owned by a
real-time thread.
Introduces isolated CPU affinity transition OCI runtime annotation
org.opencontainers.runc.exec.isolated-cpu-affinity-transition to restore
the behavior during runc exec.
Fix issue with kernel >= 6.2 not resetting CPU affinity for container processes.
Signed-off-by: Cédric Clerget <cedric.clerget@gmail.com>
This commit separates the functionality of setting cgroup device
rules out of libct/cgroups to libct/cgroups/devices package. This
package, if imported, sets the function variables in libct/cgroups and
libct/cgroups/systemd, so that a cgroup manager can use those to manage
devices. If those function variables are nil (when libct/cgroups/devices
are not imported), a cgroup manager returns the ErrDevicesUnsupported
in case any device rules are set in Resources.
It also consolidates the code from libct/cgroups/ebpf and
libct/cgroups/ebpf/devicefilter into libct/cgroups/devices.
Moved some tests in libct/cg/sd that require device management to
libct/sd/devices.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Only some libcontainer packages can be built on non-linux platforms
(not that it make sense, but at least go build succeeds). Let's call
these "good" packages.
For all other packages (i.e. ones that fail to build with GOOS other
than linux), it does not make sense to have linux build tag (as they
are broken already, and thus are not and can not be used on anything
other than Linux).
Remove linux build tag for all non-"good" packages.
This was mostly done by the following script, with just a few manual
fixes on top.
function list_good_pkgs() {
for pkg in $(find . -type d -print); do
GOOS=freebsd go build $pkg 2>/dev/null \
&& GOOS=solaris go build $pkg 2>/dev/null \
&& echo $pkg
done | sed -e 's|^./||' | tr '\n' '|' | sed -e 's/|$//'
}
function remove_tag() {
sed -i -e '\|^// +build linux$|d' $1
go fmt $1
}
SKIP="^("$(list_good_pkgs)")"
for f in $(git ls-files . | grep .go$); do
if echo $f | grep -qE "$SKIP"; then
echo skip $f
continue
fi
echo proc $f
remove_tag $f
done
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A cgroup manager's Set method sets cgroup resources, but historically
it was accepting configs.Cgroups.
Refactor it to accept resources only. This is an improvement from the
API point of view, as the method can not change cgroup configuration
(such as path to the cgroup etc), it can only set (modify) its
resources/limits.
This also lays the foundation for complicated resource updates, as now
Set has two sets of resources -- the one that was previously specified
during cgroup manager creation (or the previous Set), and the one passed
in the argument, so it could deduce the difference between these. This
is a long term goal though.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some cases, container init fails to start because it is killed by
the kernel OOM killer. The errors returned by runc in such cases are
semi-random and rather cryptic. Below are a few examples.
On cgroup v1 + systemd cgroup driver:
> process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe
> process_linux.go:352: getting the final child's pid from pipe caused: EOF
On cgroup v2:
> process_linux.go:495: container init caused: read init-p: connection reset by peer
> process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe
This commits adds the OOM method to cgroup managers, which tells whether
the container was OOM-killed. In case that has happened, the original error
is discarded (unless --debug is set), and the new OOM error is reported
instead:
> ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)
Also, fix the rootless test cases that are failing because they expect
an error in the first line, and we have an additional warning now:
> unable to get oom kill count" error="no directory specified for memory.oom_control
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.
There are no code changes in this commit, just moving it from one file
to another.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.
Signed-off-by: lifubang <lifubang@acmcoder.com>
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.
fixes#2176
Signed-off-by: Julio Montes <julio.montes@intel.com>
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1 implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>