Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120
Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.
In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.
In cgroup v2, same operation succeeds, and the container is OOM killed.
Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.
Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.
Add some test cases.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Delegating cgroups to the container enables more complex workloads,
including systemd-based workloads. The OCI runtime-spec was
recently updated to explicitly admit such delegation, through
specification of cgroup ownership semantics:
https://github.com/opencontainers/runtime-spec/pull/1123
Pursuant to the updated OCI runtime-spec, change the ownership of
the container's cgroup directory and particular files therein, when
using cgroups v2 and when the cgroupfs is to be mounted read/write.
As a result of this change, systemd workloads can run in isolated
user namespaces on OpenShift when the sandbox's cgroupfs is mounted
read/write.
It might be possible to implement this feature in other cgroup
managers, but that work is deferred.
Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is helpful to kubernetes in cases it knows for sure that the freeze
is not required (since it created the systemd unit with no device
restrictions).
As the code is trivial, no tests are required.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was initially added by commits 41d9d26513 and 4a8f0b4db4,
apparently to implement docker run --cgroup container:ID, which was
never merged. Therefore, this code is not and was never used.
It needs to be removed mainly because having it makes it much harder to
understand how cgroup manager works (because with this in place we have
not one or two but three sets of cgroup paths to think about).
Note if the paths are known and there is a need to add a PID to existing
cgroup, cgroup manager is not needed at all -- something like
cgroups.WriteCgroupProc or cgroups.EnterPid is sufficient (and the
latter is what runc exec uses in (*setnsProcess).start).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The runc update CLI is not able to modify devices, so let's set SkipDevices
(so that a cgroup controller won't try to update devices cgroup).
This helps use cases when some other device management (NVIDIA GPUs)
applies its configuration on top of what runc does.
Make sure we do not save SkipDevices into state.json.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.
The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.
Work around this by adding a SkipDevices flag to Resources.
A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).
In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).
This is one of many changes needed to clean up the devices cgroup code.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).
This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.
Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.
Example usage: add the following to runtime spec (config.json):
```
"annotations": {
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
"org.systemd.property.CollectMode":"'inactive-or-failed'"
},
```
and start the container (e.g. `runc --systemd-cgroup run $ID`).
The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".
The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.
In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.
NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.
[1] https://developer.gnome.org/glib/stable/gvariant-text.html
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.
subsystemsUnified is used on a system running with cgroups v2 unified
mode.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported
in runc anyway right now. So clean up the file naming to use `_linux` where appropriate.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>