zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-12-24 11:50:58 +08:00

Author	SHA1	Message	Date
Kailun Qin	e1584831b6	libct/cg: add CFS bandwidth burst for CPU Burstable CFS controller is introduced in Linux 5.14. This helps with parallel workloads that might be bursty. They can get throttled even when their average utilization is under quota. And they may be latency sensitive at the same time so that throttling them is undesired. This feature borrows time now against the future underrun, at the cost of increased interference against the other system users, by introducing cfs_burst_us into CFS bandwidth control to enact the cap on unused bandwidth accumulation, which will then used additionally for burst. The patch adds the support/control for CFS bandwidth burst. runtime-spec: https://github.com/opencontainers/runtime-spec/pull/1120 Co-authored-by: Akihiro Suda <suda.kyoto@gmail.com> Co-authored-by: Nadeshiko Manju <me@manjusaka.me> Signed-off-by: Kailun Qin <kailun.qin@intel.com>	2023-09-06 23:23:30 +08:00
wineway	81c379fa8b	support SCHED_IDLE for runc cgroupfs Signed-off-by: wineway <wangyuweihx@gmail.com>	2023-01-31 15:19:05 +08:00
Kir Kolyshkin	6462e9de67	runc update: implement memory.checkBeforeUpdate This is aimed at solving the problem of cgroup v2 memory controller behavior which is not compatible with that of cgroup v1. In cgroup v1, if the new memory limit being set is lower than the current usage, setting the new limit fails. In cgroup v2, same operation succeeds, and the container is OOM killed. Introduce a new setting, memory.checkBeforeUpdate, and use it to mimic cgroup v1 behavior. Note that this is not 100% reliable because of TOCTOU, but this is the best we can do. Add some test cases. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-11-02 17:15:26 -07:00
Fraser Tweedale	35d20c4e0b	chown cgroup to process uid in container namespace Delegating cgroups to the container enables more complex workloads, including systemd-based workloads. The OCI runtime-spec was recently updated to explicitly admit such delegation, through specification of cgroup ownership semantics: https://github.com/opencontainers/runtime-spec/pull/1123 Pursuant to the updated OCI runtime-spec, change the ownership of the container's cgroup directory and particular files therein, when using cgroups v2 and when the cgroupfs is to be mounted read/write. As a result of this change, systemd workloads can run in isolated user namespaces on OpenShift when the sandbox's cgroupfs is mounted read/write. It might be possible to implement this feature in other cgroup managers, but that work is deferred. Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>	2021-11-30 08:52:59 +10:00
Kir Kolyshkin	097c6d7425	libct/cg: simplify getting cgroup manager 1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). [v2: can't -> cannot] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 09:11:44 -07:00
Kir Kolyshkin	3c7db3827c	Merge pull request #2883 from flouthoc/master Add support for rdma cgroup introduced in Linux Kernel 4.11	2021-08-30 20:02:04 -07:00
Qiang Huang	b4b797200e	Merge pull request #3136 from kolyshkin/cg-d-c libct/cg: rm dead code to improve clarity	2021-08-25 14:46:27 +08:00
flouthoc	b3d14488b5	Add support for rdma cgroup introduced in Linux Kernel 4.11 Signed-off-by: Aditya Rajan <flouthoc.git@gmail.com>	2021-08-23 12:25:33 +05:30
Kir Kolyshkin	9a095e44db	libct/cg/sd/v1: add SkipFreezeOnSet knob This is helpful to kubernetes in cases it knows for sure that the freeze is not required (since it created the systemd unit with no device restrictions). As the code is trivial, no tests are required. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-18 12:43:55 -07:00
Kir Kolyshkin	1cbfe23464	libct/cg: rm dead code This was initially added by commits `41d9d26513` and `4a8f0b4db4`, apparently to implement docker run --cgroup container:ID, which was never merged. Therefore, this code is not and was never used. It needs to be removed mainly because having it makes it much harder to understand how cgroup manager works (because with this in place we have not one or two but three sets of cgroup paths to think about). Note if the paths are known and there is a need to add a PID to existing cgroup, cgroup manager is not needed at all -- something like cgroups.WriteCgroupProc or cgroups.EnterPid is sufficient (and the latter is what runc exec uses in (*setnsProcess).start). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-08 13:03:51 -07:00
Kir Kolyshkin	bf7492ee5d	runc update: skip devices The runc update CLI is not able to modify devices, so let's set SkipDevices (so that a cgroup controller won't try to update devices cgroup). This helps use cases when some other device management (NVIDIA GPUs) applies its configuration on top of what runc does. Make sure we do not save SkipDevices into state.json. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-03 10:40:55 -07:00
Sebastiaan van Stijn	e204d6a9e7	libcontainer/configs: add / fix godoc (golint) Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-06-02 17:44:11 +02:00
Kir Kolyshkin	52390d6804	Ignore kernel memory settings This is somewhat radical approach to deal with kernel memory. Per-cgroup kernel memory limiting was always problematic. A few examples: - older kernels had bugs and were even oopsing sometimes (best example is RHEL7 kernel); - kernel is unable to reclaim the kernel memory so once the limit is hit a cgroup is toasted; - some kernel memory allocations don't allow failing. In addition to that, - users don't have a clue about how to set kernel memory limits (as the concept is much more complicated than e.g. [user] memory); - different kernels might have different kernel memory usage, which is sort of unexpected; - cgroup v2 do not have a [dedicated] kmem limit knob, and thus runc silently ignores kernel memory limits for v2; - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see https://github.com/torvalds/linux/commit/0158115f702b). In view of all this, and as the runtime-spec lists memory.kernel and memory.kernelTCP as OPTIONAL, let's ignore kernel memory limits (for cgroup v1, same as we're already doing for v2). This should result in less bugs and better user experience. The only bad side effect from it might be that stat can show kernel memory usage as 0 (since the accounting is not enabled). [v2: add a warning in specconv that limits are ignored] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-12 12:18:11 -07:00
Sebastiaan van Stijn	4fc2de77e9	libcontainer/devices: remove "Device" prefix from types Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:23 +01:00
Sebastiaan van Stijn	677baf22d2	libcontainer: isolate libcontainer/devices Move the Device-related types to libcontainer/devices, so that the package can be used in isolation. Aliases have been created in libcontainer/configs for backward compatibility. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:21 +01:00
Kir Kolyshkin	b006f4a180	libct/cgroups: support Cgroups.Resources.Unified Add support for unified resource map (as per [1]), and add some test cases for the new functionality. [1] https://github.com/opencontainers/runtime-spec/pull/1040 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-24 15:29:35 -07:00
Kir Kolyshkin	108ee85b82	libct/cgroups: add SkipDevices to Resources The kubelet uses libct/cgroups code to set up cgroups. It creates a parent cgroup (kubepods) to put the containers into. The problem (for cgroupv2 that uses eBPF for device configuration) is the hard requirement to have devices cgroup configured results in leaking an eBPF program upon every kubelet restart. program. If kubelet is restarted 64+ times, the cgroup can't be configured anymore. Work around this by adding a SkipDevices flag to Resources. A check was added so that if SkipDevices is set, such a "container" can't be started (to make sure it is only used for non-containers). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-02 15:19:31 -07:00
Kir Kolyshkin	4189cb65f8	cgroups: remove cgroup.Resources.CpuMax This (and the converting function) is only used by one of the four cgroup drivers. The other three do some checking and conversion in place, so let the fs2 do the same. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-09 17:15:38 -07:00
Aleksa Sarai	24388be71e	configs: use different types for .Devices and .Resources.Devices Making them the same type is simply confusing, but also means that you could accidentally use one in the wrong context. This eliminates that problem. This also includes a whole bunch of cleanups for the types within DeviceRule, so that they can be used more ergonomically. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Aleksa Sarai	b2bec9806f	cgroup: devices: eradicate the Allow/Deny lists These lists have been in the codebase for a very long time, and have been unused for a large portion of that time -- specconv doesn't generate them and the only user of these flags has been tests (which doesn't inspire much confidence). In addition, we had an incorrect implementation of a white-list policy. This wasn't exploitable because all of our users explicitly specify "deny all" as the first rule, but it was a pretty glaring issue that came from the "feature" that users can select whether they prefer a white- or black- list. Fix this by always writing a deny-all rule (which is what our users were doing anyway, to work around this bug). This is one of many changes needed to clean up the devices cgroup code. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Akihiro Suda	492d525e55	vendor: update go-systemd and godbus Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-03-16 13:26:03 +09:00
Kir Kolyshkin	4c5c3fb960	Support for setting systemd properties via annotations In case systemd is used to set cgroups for the container, it creates a scope unit dedicated to it (usually named `runc-$ID.scope`). This patch adds an ability to set arbitrary systemd properties for the systemd unit via runtime spec annotations. Initially this was developed as an ability to specify the `TimeoutStopUSec` property, but later generalized to work with arbitrary ones. Example usage: add the following to runtime spec (config.json): ``` "annotations": { "org.systemd.property.TimeoutStopUSec": "uint64 123456789", "org.systemd.property.CollectMode":"'inactive-or-failed'" }, ``` and start the container (e.g. `runc --systemd-cgroup run $ID`). The above will set the following systemd parameters: * `TimeoutStopSec` to 2 minutes and 3 seconds, * `CollectMode` to "inactive-or-failed". The values are in the gvariant format (see [1]). To figure out which type systemd expects for a particular parameter, see systemd sources. In particular, parameters with `USec` suffix require an `uint64` typed argument, while gvariant assumes int32 for a numeric values, therefore the explicit type is required. NOTE that systemd receives the time-typed parameters as USec but shows them (in `systemctl show`) as Sec. For example, the stop timeout should be set as `TimeoutStopUSec` but is shown as `TimeoutStopSec`. [1] https://developer.gnome.org/glib/stable/gvariant-text.html Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-02-17 16:07:19 -08:00
Giuseppe Scrivano	1932917b71	libcontainer: add initial support for cgroups v2 allow to set what subsystems are used by libcontainer/cgroups/fs.Manager. subsystemsUnified is used on a system running with cgroups v2 unified mode. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-05 13:02:25 +02:00
Justin Cormack	3d9074ead3	Update memory specs to use int64 not uint64 replace #1492 #1494 fix #1422 Since https://github.com/opencontainers/runtime-spec/pull/876 the memory specifications are now `int64`, as that better matches the visible interface where `-1` is a valid value. Otherwise finding the correct value was difficult as it was kernel dependent. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2017-06-27 12:16:07 +01:00
Justin Cormack	4c67360296	Clean up unix vs linux usage FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported in runc anyway right now. So clean up the file naming to use `_linux` where appropriate. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2017-05-12 17:22:09 +01:00

25 Commits