Commit Graph

57 Commits

Author SHA1 Message Date
Aleksa Sarai
8ab2458bc4 update: switch to generics for mkPtr logic
This is much easier to read and removes the need for explicit per-type
helper functions.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-11 15:16:50 +11:00
Aleksa Sarai
3b75374cc7 runtime-spec: update pids.limit handling to match new guidance
The main update is actually in github.com/opencontainers/cgroups, but we
need to also update runtime-spec to a newer pre-release version to get
the updates from there as well.

In short, the behaviour change is now that "0" is treated as a valid
value to set in "pids.max", "-1" means "max" and unset/nil means "do
nothing". As described in the opencontainers/cgroups PR, this change is
actually backwards compatible because our internal state.json stores
PidsLimit, and that entry is marked as "omitempty". So, an old runc
would omit PidsLimit=0 in state.json, and this will be parsed by a new
runc as being "nil" -- and both would treat this case as "do not set
anything".

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-11-11 15:15:27 +11:00
Markus Lehtonen
85801e845e runc update: refuse to create new rdt group
Error out --l3-cache-schema and --mem-bw-schema if the original
spec didn't specify intelRdt which also means that no CLOS (resctrl
group) was created for the container.

This prevents serious issues in this corner case.

First, a CLOS was created but the schemata of the CLOS was not
correctly updated. Confusingly, calling runc update twice
did the job: the first call created the resctrl group and the seccond
call was able to update the schemata. This issue would be relatively
easily fixable, though.

Second, more severe issue is that creating new CLOSes this way caused
them to be orphaned, not being removed when the container exists. This
is caused by runc not capturing the updated state (original spec was
intelRdt=nil -> no CLOS but after update this is not the case).

The most severe problem is that the update only move (or tried to move)
the original init process pid but all children escaped the update. Doing
this (i.e. migrating all processes of a container from CLOS to another
CLOS) reliably, race-free, would probably require freezing the
container.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2025-08-01 14:36:51 +03:00
Markus Lehtonen
57b6a317bb runc update: don't lose intelRdt state
Prevent --l3-cache-schema from clearing the intel_rdt.memBwSchema state
and --mem-bw-schema clearing l3_cache_schema, respectively.

Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
2025-07-31 17:31:52 +03:00
Kir Kolyshkin
0b01dccfbb runc update: handle duplicated devs properly
In case there's a duplicate in the device list, the latter entry
overrides the former one.

So, we need to modify the last entry, not the first one. To do that,
use slices.Backward.

Amend the test case to test the fix.

Reported-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-06-17 15:16:55 -07:00
Kir Kolyshkin
7696402dac runc update: support per-device weight and iops
This support was missing from runc, and thus the example from the
podman-update wasn't working.

To fix, introduce a function to either update or insert new weights and iops.

Add integration tests.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-06-17 15:16:55 -07:00
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Akihiro Suda
e4bf49ff2c runc update: distinguish nil from zero
Prior to this commit, commands like `runc update --cpuset-cpus=1`
were implying to set cpu burst to "0" (which does not mean "leave it as is").

This was failing when the kernel does not support cpu burst:
`openat2 /sys/fs/cgroup/runc-cgroups-integration-test/test-cgroup-22167/cpu.max.burst: no such file or directory`

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-04-04 00:56:19 +08:00
dependabot[bot]
606251ab33 build(deps): bump github.com/opencontainers/runtime-spec
Bumps [github.com/opencontainers/runtime-spec](https://github.com/opencontainers/runtime-spec) from 1.1.1-0.20230823135140-4fec88fd00a4 to 1.2.0.
- [Release notes](https://github.com/opencontainers/runtime-spec/releases)
- [Changelog](https://github.com/opencontainers/runtime-spec/blob/main/ChangeLog)
- [Commits](https://github.com/opencontainers/runtime-spec/commits/v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/opencontainers/runtime-spec
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2024-03-07 14:43:33 +09:00
Kailun Qin
e1584831b6 libct/cg: add CFS bandwidth burst for CPU
Burstable CFS controller is introduced in Linux 5.14. This helps with
parallel workloads that might be bursty. They can get throttled even
when their average utilization is under quota. And they may be latency
sensitive at the same time so that throttling them is undesired.

This feature borrows time now against the future underrun, at the cost
of increased interference against the other system users, by introducing
cfs_burst_us into CFS bandwidth control to enact the cap on unused
bandwidth accumulation, which will then used additionally for burst.

The patch adds the support/control for CFS bandwidth burst.

runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120

Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com>
Co-authored-by: Nadeshiko Manju <me@manjusaka.me>
Signed-off-by: Kailun Qin <kailun.qin@intel.com>
2023-09-06 23:23:30 +08:00
hang.jiang
937ca107c3 Fix File to Close
Signed-off-by: hang.jiang <hang.jiang@daocloud.io>
2023-09-01 16:17:13 +08:00
wineway
81c379fa8b support SCHED_IDLE for runc cgroupfs
Signed-off-by: wineway <wangyuweihx@gmail.com>
2023-01-31 15:19:05 +08:00
Kir Kolyshkin
6462e9de67 runc update: implement memory.checkBeforeUpdate
This is aimed at solving the problem of cgroup v2 memory controller
behavior which is not compatible with that of cgroup v1.

In cgroup v1, if the new memory limit being set is lower than the
current usage, setting the new limit fails.

In cgroup v2, same operation succeeds, and the container is OOM killed.

Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic
cgroup v1 behavior.

Note that this is not 100% reliable because of TOCTOU, but this is the
best we can do.

Add some test cases.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-11-02 17:15:26 -07:00
Kir Kolyshkin
89733cd055 Format sources using gofumpt 0.2.1
... which adds a wee more whitespace fixes.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-07 10:42:01 -08:00
Kir Kolyshkin
c5b0be78e8 Rm build tags from main pkg
This was added by commit 5aa82c950 back in the day when we thought
runc is going to be cross-platform. It's very clear now it's Linux-only
package.

While at it, further clarify it in README that we're Linux only.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:01 -07:00
Kir Kolyshkin
75761bccf7 Fix codespell warnings, add codespell to ci
The two exceptions I had to add to codespellrc are:
 - CLOS (used by intelrtd);
 - creat (syscall name used in tests/integration/testdata/seccomp_*.json).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-17 16:12:35 -07:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
cf4ecaed08 runc update: hide --kernel* options
Commit 52390d6804 made this parameters obsoleted, but they are
still shown in e.g. runc update --help output.

Hide them (and maybe in 5 years we can remove them).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-15 17:48:40 -07:00
Kir Kolyshkin
bf7492ee5d runc update: skip devices
The runc update CLI is not able to modify devices, so let's set SkipDevices
(so that a cgroup controller won't try to update devices cgroup).

This helps use cases when some other device management (NVIDIA GPUs)
applies its configuration on top of what runc does.

Make sure we do not save SkipDevices into state.json.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-03 10:40:55 -07:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Kir Kolyshkin
52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00
Xiaochen Shen
f62ad4a0de libcontainer/intelrdt: rename CAT and MBA enabled flags
Rename CAT and MBA enabled flags to be consistent with others.
No functional change.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-11-10 15:32:01 +08:00
Kir Kolyshkin
390a98f3fd runc update: support unified resources
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-11-05 16:04:41 -08:00
Xiaochen Shen
2c004a101e libcontainer/intelrdt: introduce NewManager()
Introduce NewManager() to wrap up IntelRdtManager initialization. And
call it when required.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-10-17 16:59:20 +08:00
Kir Kolyshkin
32746fb334 update: do not overwrite old cpu quota/period
Seting CPU quota and period independently does not make much sense,
but historically runc allowed it and this needs to be supported
to not break compatibility.

For systemd cgroup drivers to set CPU quota/period correctly,
it needs to know both values. For fs2 cgroup driver to be compatible
with the fs driver, it also needs to know both values.

Here in update, previously set values are available from config.
If only one of {quota,period} is set and the other is not, leave
the unset parameter at the old value (don't overwrite config).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:15:56 -07:00
Kir Kolyshkin
4189cb65f8 cgroups: remove cgroup.Resources.CpuMax
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:15:38 -07:00
John Hwang
7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
Akihiro Suda
aa269315a4 cgroup2: add CpuMax conversion
Fix #2243

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:58:39 +09:00
Boris Popovschi
7c439cc6f6 Added conversion for cpu.weight v2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-12 11:32:34 +02:00
Xiaochen Shen
1ed597bfe6 libcontainer: intelrdt: add update command support for Intel RDT/MBA
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:34 +08:00
Xiaochen Shen
27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Xiaochen Shen
65918b02a9 intelrdt: add update command support
Add runc update command support for Intel RDT/CAT.

for example:
runc update --l3-cache-schema "L3:0=f;1=f" <container-id>

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-20 01:59:06 +08:00
Justin Cormack
3d9074ead3 Update memory specs to use int64 not uint64
replace #1492 #1494
fix #1422

Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-06-27 12:16:07 +01:00
Michael Crosby
854b41d81e Update spec to 239c4e44f2
This provides updates to runc for the spec changes with *Process and
OOMScoreAdj

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-06-01 16:29:47 -07:00
Kenfe-Mickael Laventure
1e7e276aff Allow updating container pids limit
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-04-28 06:44:44 -07:00
Qiang Huang
8430cc4f48 Use uint64 for resources to keep consistency with runtime-spec
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-20 18:51:39 +08:00
Mrunal Patel
4f9cb13b64 Update runtime spec to 1.0.0.rc5
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-03-15 11:38:37 -07:00
Qiang Huang
6b1d0e76f2 Merge pull request #1127 from boynux/fix-set-mem-to-unlimited
Fixes set memory to unlimited
2017-02-16 09:51:23 +08:00
Mohammad Arab
18ebc51b3c Reset Swap when memory is set to unlimited (-1)
Kernel validation fails if memory set to -1 which is unlimited but
swap is not set so.

Signed-off-by: Mohammad Arab <boynux@gmail.com>
2017-02-15 08:11:57 +01:00
Mrunal Patel
84a3bd250c Simplify error handling on function return
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-01-06 15:57:31 -08:00
Aleksa Sarai
c6d8a2f26f merge branch 'pr-1158'
Closes #1158
LGTMs: @hqhq @cyphar
2016-12-26 13:59:47 +11:00
Zhang Wei
8eea644ccc Bump runtime-spec to v1.0.0-rc3
* Bump underlying runtime-spec to version 1.0.0-rc3
* Fix related changed struct names in config.go

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-12-17 14:02:35 +08:00
Zhang Wei
b517076907 Check args numbers before application start
Add a general args number validator for all client commands.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-11-29 11:18:51 +08:00
Zhang Wei
6cd425be2b Allow update rt_period_us and rt_runtime_us
Currently runc already supports setting realtime runtime and period
before container processes start, this commit will add update support
for realtime scheduler resources.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-11-04 18:57:22 +08:00
rajasec
2d0d936b76 Small correction in update resource file usage
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-10-28 22:58:08 +05:30
Peng Gao
c5393da813 Refactor enum map range to slice range
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2016-09-28 15:36:29 +08:00
Qiang Huang
9ebf816d03 Fix help message for memory-swap
Back quotes are the placeholder feature described here:
https://github.com/urfave/cli#placeholder-values

Without this, cli will take `-1` as default value as:
```
   --memory-swap -1             Total memory usage (memory + swap); set `-1` to enable unlimited swap
```

After this patch, it'll act correctly
```
   --memory-swap value          Total memory usage (memory + swap); set '-1' to enable unlimited swap
```

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-07-21 19:37:36 +08:00
Mrunal Patel
a753b06645 Replace github.com/codegangsta/cli by github.com/urfave/cli
The package got moved to a different repository

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-06-06 11:47:20 -07:00
Mrunal Patel
0c5e6e5b27 Merge pull request #851 from hqhq/sync_man_page
Update man pages to refect the latest cli change
2016-06-01 13:20:54 -07:00
Qiang Huang
71511dc155 Improve update memory
Support update memory with:
runc update --memory 50M container-id

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-05-30 18:56:10 +08:00