mirror of
https://github.com/opencontainers/runc.git
synced 2025-10-21 06:30:34 +08:00

Documentation was moved from https://docs.gtk.org/glib/gvariant-text.html to https://docs.gtk.org/glib/gvariant-text-format.html. Signed-off-by: ver4a <verca@uncontrol.me>
132 lines
5.9 KiB
Markdown
132 lines
5.9 KiB
Markdown
## systemd cgroup driver
|
|
|
|
By default, runc creates cgroups and sets cgroup limits on its own (this mode
|
|
is known as fs cgroup driver). When `--systemd-cgroup` global option is given
|
|
(as in e.g. `runc --systemd-cgroup run ...`), runc switches to systemd cgroup
|
|
driver. This document describes its features and peculiarities.
|
|
|
|
### systemd unit name and placement
|
|
|
|
When creating a container, runc requests systemd (over dbus) to create
|
|
a transient unit for the container, and place it into a specified slice.
|
|
|
|
The name of the unit and the containing slice is derived from the container
|
|
runtime spec in the following way:
|
|
|
|
1. If `Linux.CgroupsPath` is set, it is expected to be in the form
|
|
`[slice]:[prefix]:[name]`.
|
|
|
|
Here `slice` is a systemd slice under which the container is placed.
|
|
If empty, it defaults to `system.slice`, except when cgroup v2 is
|
|
used and rootless container is created, in which case it defaults
|
|
to `user.slice`.
|
|
|
|
Note that `slice` can contain dashes to denote a sub-slice
|
|
(e.g. `user-1000.slice` is a correct notation, meaning a subslice
|
|
of `user.slice`), but it must not contain slashes (e.g.
|
|
`user.slice/user-1000.slice` is invalid).
|
|
|
|
A `slice` of `-` represents a root slice.
|
|
|
|
Next, `prefix` and `name` are used to compose the unit name, which
|
|
is `<prefix>-<name>.scope`, unless `name` has `.slice` suffix, in
|
|
which case `prefix` is ignored and the `name` is used as is.
|
|
|
|
2. If `Linux.CgroupsPath` is not set or empty, it works the same way as if it
|
|
would be set to `:runc:<container-id>`. See the description above to see
|
|
what it transforms to.
|
|
|
|
As described above, a unit being created can either be a scope or a slice.
|
|
For a scope, runc specifies its parent slice via a _Slice=_ systemd property,
|
|
and also sets _Delegate=true_. For a slice, runc specifies a weak dependency on
|
|
the parent slice via a _Wants=_ property.
|
|
|
|
### Resource limits
|
|
|
|
runc always enables accounting for all controllers, regardless of any limits
|
|
being set. This means it unconditionally sets the following properties for the
|
|
systemd unit being created:
|
|
|
|
* _CPUAccounting=true_
|
|
* _IOAccounting=true_ (_BlockIOAccounting_ for cgroup v1)
|
|
* _MemoryAccounting=true_
|
|
* _TasksAccounting=true_
|
|
|
|
The resource limits of the systemd unit are set by runc by translating the
|
|
runtime spec resources to systemd unit properties.
|
|
|
|
Such translation is by no means complete, as there are some cgroup properties
|
|
that can not be set via systemd. Therefore, runc systemd cgroup driver is
|
|
backed by fs driver (in other words, cgroup limits are first set via systemd
|
|
unit properties, and when by writing to cgroupfs files).
|
|
|
|
The set of runtime spec resources which is translated by runc to systemd unit
|
|
properties depends on kernel cgroup version being used (v1 or v2), and on the
|
|
systemd version being run. If an older systemd version (which does not support
|
|
some resources) is used, runc do not set those resources.
|
|
|
|
The following tables summarize which properties are translated.
|
|
|
|
#### cgroup v1
|
|
|
|
| runtime spec resource | systemd property name | min systemd version |
|
|
|-----------------------|-----------------------|---------------------|
|
|
| memory.limit | MemoryLimit | |
|
|
| cpu.shares | CPUShares | |
|
|
| blockIO.weight | BlockIOWeight | |
|
|
| pids.limit | TasksMax | |
|
|
| cpu.cpus | AllowedCPUs | v244 |
|
|
| cpu.mems | AllowedMemoryNodes | v244 |
|
|
|
|
#### cgroup v2
|
|
|
|
| runtime spec resource | systemd property name | min systemd version |
|
|
|-------------------------|-----------------------|---------------------|
|
|
| memory.limit | MemoryMax | |
|
|
| memory.reservation | MemoryLow | |
|
|
| memory.swap | MemorySwapMax | |
|
|
| cpu.shares | CPUWeight | |
|
|
| pids.limit | TasksMax | |
|
|
| cpu.cpus | AllowedCPUs | v244 |
|
|
| cpu.mems | AllowedMemoryNodes | v244 |
|
|
| unified.cpu.max | CPUQuota, CPUQuotaPeriodSec | v242 |
|
|
| unified.cpu.weight | CPUWeight | |
|
|
| unified.cpu.idle | CPUWeight | v252 |
|
|
| unified.cpuset.cpus | AllowedCPUs | v244 |
|
|
| unified.cpuset.mems | AllowedMemoryNodes | v244 |
|
|
| unified.memory.high | MemoryHigh | |
|
|
| unified.memory.low | MemoryLow | |
|
|
| unified.memory.min | MemoryMin | |
|
|
| unified.memory.max | MemoryMax | |
|
|
| unified.memory.swap.max | MemorySwapMax | |
|
|
| unified.pids.max | TasksMax | |
|
|
|
|
For documentation on systemd unit resource properties, see
|
|
`systemd.resource-control(5)` man page.
|
|
|
|
### Auxiliary properties
|
|
|
|
Auxiliary properties of a systemd unit (as shown by `systemctl show
|
|
<unit-name>` after the container is created) can be set (or overwritten) by
|
|
adding annotations to the container runtime spec (`config.json`).
|
|
|
|
For example:
|
|
|
|
```json
|
|
"annotations": {
|
|
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
|
|
"org.systemd.property.CollectMode":"'inactive-or-failed'"
|
|
},
|
|
```
|
|
|
|
The above will set the following properties:
|
|
|
|
* `TimeoutStopSec` to 2 minutes and 3 seconds;
|
|
* `CollectMode` to "inactive-or-failed".
|
|
|
|
The values must be in the gvariant text format, as described in
|
|
[gvariant documentation](https://docs.gtk.org/glib/gvariant-text-format.html).
|
|
|
|
To find out which type systemd expects for a particular parameter, please
|
|
consult systemd sources.
|