The main update is actually in github.com/opencontainers/cgroups, but we
need to also update runtime-spec to a newer pre-release version to get
the updates from there as well.
In short, the behaviour change is now that "0" is treated as a valid
value to set in "pids.max", "-1" means "max" and unset/nil means "do
nothing". As described in the opencontainers/cgroups PR, this change is
actually backwards compatible because our internal state.json stores
PidsLimit, and that entry is marked as "omitempty". So, an old runc
would omit PidsLimit=0 in state.json, and this will be parsed by a new
runc as being "nil" -- and both would treat this case as "do not set
anything".
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Error out --l3-cache-schema and --mem-bw-schema if the original
spec didn't specify intelRdt which also means that no CLOS (resctrl
group) was created for the container.
This prevents serious issues in this corner case.
First, a CLOS was created but the schemata of the CLOS was not
correctly updated. Confusingly, calling runc update twice
did the job: the first call created the resctrl group and the seccond
call was able to update the schemata. This issue would be relatively
easily fixable, though.
Second, more severe issue is that creating new CLOSes this way caused
them to be orphaned, not being removed when the container exists. This
is caused by runc not capturing the updated state (original spec was
intelRdt=nil -> no CLOS but after update this is not the case).
The most severe problem is that the update only move (or tried to move)
the original init process pid but all children escaped the update. Doing
this (i.e. migrating all processes of a container from CLOS to another
CLOS) reliably, race-free, would probably require freezing the
container.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Prevent --l3-cache-schema from clearing the intel_rdt.memBwSchema state
and --mem-bw-schema clearing l3_cache_schema, respectively.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
In case there's a duplicate in the device list, the latter entry
overrides the former one.
So, we need to modify the last entry, not the first one. To do that,
use slices.Backward.
Amend the test case to test the fix.
Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This support was missing from runc, and thus the example from the
podman-update wasn't working.
To fix, introduce a function to either update or insert new weights and iops.
Add integration tests.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.
Mostly generated by:
git rm -f libcontainer/cgroups
find . -type f -name "*.go" -exec sed -i \
's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
{} +
go get github.com/opencontainers/cgroups@v0.0.1
make vendor
gofumpt -w .
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Prior to this commit, commands like `runc update --cpuset-cpus=1`
were implying to set cpu burst to "0" (which does not mean "leave it as is").
This was failing when the kernel does not support cpu burst:
`openat2 /sys/fs/cgroup/runc-cgroups-integration-test/test-cgroup-22167/cpu.max.burst: no such file or directory`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.
This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.
The patch adds the support/control for CFS bandwidth burst.
runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120
Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.
In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.
In cgroup v2, same operation succeeds, and the container is OOM killed.
Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.
Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.
Add some test cases.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was added by commit 5aa82c950 back in the day when we thought
runc is going to be cross-platform. It's very clear now it's Linux-only
package.
While at it, further clarify it in README that we're Linux only.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The two exceptions I had to add to codespellrc are:
- CLOS (used by intelrtd);
- creat (syscall name used in tests/integration/testdata/seccomp_*.json).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 52390d6804 made this parameters obsoleted, but they are
still shown in e.g. runc update --help output.
Hide them (and maybe in 5 years we can remove them).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The runc update CLI is not able to modify devices, so let's set SkipDevices
(so that a cgroup controller won't try to update devices cgroup).
This helps use cases when some other device management (NVIDIA GPUs)
applies its configuration on top of what runc does.
Make sure we do not save SkipDevices into state.json.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Seting CPU quota and period independently does not make much sense,
but historically runc allowed it and this needs to be supported
to not break compatibility.
For systemd cgroup drivers to set CPU quota/period correctly,
it needs to know both values. For fs2 cgroup driver to be compatible
with the fs driver, it also needs to know both values.
Here in update, previously set values are available from config.
If only one of {quota,period} is set and the other is not, leave
the unset parameter at the old value (don't overwrite config).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Currently runc already supports setting realtime runtime and period
before container processes start, this commit will add update support
for realtime scheduler resources.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
Back quotes are the placeholder feature described here:
https://github.com/urfave/cli#placeholder-values
Without this, cli will take `-1` as default value as:
```
--memory-swap -1 Total memory usage (memory + swap); set `-1` to enable unlimited swap
```
After this patch, it'll act correctly
```
--memory-swap value Total memory usage (memory + swap); set '-1' to enable unlimited swap
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>