zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-18 05:20:53 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	13afa58d0e	libct/cg/sd/v2: support cpuset.* / Allowed* * cpuset.cpus -> AllowedCPUs * cpuset.mems -> AllowedMemoryNodes No test for cgroup v2 resources.unified override, as this requires a separate test case, and all the unified resources are handled uniformly so there's little sense to test all parameters. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-11-05 16:04:57 -08:00
Kir Kolyshkin	5be8b97aec	libct/cg/sd/v2: support cpu.weight / CPUWeight Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-11-05 16:04:46 -08:00
Kir Kolyshkin	ab80eb32d2	libct/cg/sd/v2: support cpu.max unified resource Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-11-05 16:04:37 -08:00
Kir Kolyshkin	0cb8bf67a3	Initial v2 resources.unified systemd support In case systemd is used as cgroups manager, and a user sets some resources using unified resource map (as per [1]), systemd is not aware of any parameters, so there will be a discrepancy between the cgroupfs state and systemd unit state. Let's try to fix that by converting known unified resources to systemd properties. Currently, this is only implemented for pids.max as a POC. Some other parameters (that might or might not have systemd unit property equivalents) are: $ ls -l \| grep w- -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.freeze -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.max.depth -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.max.descendants -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.procs -rw-r--r--. 1 root root 0 Oct 21 09:43 cgroup.subtree_control -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.threads -rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.type -rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.max -rw-r--r--. 1 root root 0 Oct 10 13:57 cpu.pressure -rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.cpus -rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.cpus.partition -rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.mems -rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.weight -rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.weight.nice -rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.1GB.max -rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.1GB.rsvd.max -rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.2MB.max -rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.2MB.rsvd.max -rw-r--r--. 1 root root 0 Oct 22 10:30 io.bfq.weight -rw-r--r--. 1 root root 0 Oct 22 10:30 io.latency -rw-r--r--. 1 root root 0 Oct 22 10:30 io.max -rw-r--r--. 1 root root 0 Oct 10 13:57 io.pressure -rw-r--r--. 1 root root 0 Oct 22 10:30 io.weight -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.high -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.low -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.max -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.min -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.oom.group -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.pressure -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.swap.high -rw-r--r--. 1 root root 0 Oct 10 13:57 memory.swap.max Surely, it is a manual conversion for every such case... [1] https://github.com/opencontainers/runtime-spec/pull/1040 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-11-05 16:04:19 -08:00
Kir Kolyshkin	108ee85b82	libct/cgroups: add SkipDevices to Resources The kubelet uses libct/cgroups code to set up cgroups. It creates a parent cgroup (kubepods) to put the containers into. The problem (for cgroupv2 that uses eBPF for device configuration) is the hard requirement to have devices cgroup configured results in leaking an eBPF program upon every kubelet restart. program. If kubelet is restarted 64+ times, the cgroup can't be configured anymore. Work around this by adding a SkipDevices flag to Resources. A check was added so that if SkipDevices is set, such a "container" can't be started (to make sure it is only used for non-containers). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-07-02 15:19:31 -07:00
Mrunal Patel	406298fdf0	Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period cgroups/systemd: add setting CPUQuotaPeriod prop	2020-06-17 12:03:30 -07:00
Kir Kolyshkin	e751a168dc	cgroups/systemd: add setting CPUQuotaPeriod prop For some reason, runc systemd drivers (both v1 and v2) never set systemd unit property named `CPUQuotaPeriod` (known as `CPUQuotaPeriodUSec` on dbus and in `systemctl show` output). Set it, and add a check to all the integration tests. The check is less than trivial because, when not set, the value is shown as "infinity" but when set to the same (default) value, shown as "100ms", so in case we expect 100ms (period = 100000 us), we have to _also_ check for "infinity". [v2: add systemd version checks since CPUQuotaPeriod requires v242+] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-16 15:48:06 -07:00
Kir Kolyshkin	5b247e739c	Merge pull request #2338 from lifubang/systemdcgroupv2 fix path error in systemd when stopped LGTMs: @mrunalp @AkihiroSuda	2020-06-15 18:01:13 -07:00
Kir Kolyshkin	8b9646775e	cgroups/systemd: unify adding CpuQuota The code that adds CpuQuotaPerSecUSec is the same in v1 and v2 systemd cgroup driver. Move it to common. No functional change. Note that the comment telling that we always set this property contradicts with the current code, and therefore it is removed. [v2: drop cgroupv1-specific comment] [v3: drop returning error as it's not used] [v4: remove an obsoleted comment] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-09 17:14:43 -07:00
Kir Kolyshkin	2ce20ed158	cgroups/systemd: simplify gen*ResourcesProperties Use r instead of c.Resources for readability. No functional change. This commit has been brought to you by '<,'>s/c\.Resources\./r./g Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-08 13:42:09 -07:00
lifubang	9087f2e827	fix path error in systemd when stopped When we use cgroup with systemd driver, the cgroup path will be auto removed by systemd when all processes exited. So we should check cgroup path exists when we access the cgroup path, for example in `kill/ps`, or else we will got an error. Signed-off-by: lifubang <lifubang@acmcoder.com>	2020-06-02 18:17:43 +08:00
Kir Kolyshkin	3c6e8ac4d2	cgroupv2: set mem+swap to max if mem set to max ... and mem+swap is not explicitly set otherwise. This ensures compatibility with cgroupv1 controller which interprets things this way. With this fixed, we can finally enable swap tests for cgroupv2. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-22 21:32:16 -07:00
Kir Kolyshkin	59897367c4	cgroups/systemd: allow to set -1 as pids.limit Currently, both systemd cgroup drivers (v1 and v2) only set "TasksMax" unit property if the value > 0, so there is no way to update the limit to -1 / unlimited / infinity / max. Since systemd driver is backed by fs driver, and both fs and fs2 set the limit of -1 properly, it works, but systemd still has the old value: # runc --systemd-cgroup update $CT --pids-limit 42 # systemctl show runc-$CT.scope \| grep TasksMax TasksMax=42 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max 42 # ./runc --systemd-cgroup update $CT --pids-limit -1 # systemctl show runc-$CT.scope \| grep TasksMax= TasksMax=42 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max max Fix by changing the condition to allow -1 as a valid value. NOTE other negative values are still being ignored by systemd drivers (as it was done before). I am not sure whether this is correct, or should we return an error. A test case is added. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-20 13:20:04 -07:00
Kir Kolyshkin	e4a84bea99	cgroupv2+systemd: set MemoryLow For some reason, this was not set before. Test case is added by the next commit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-20 13:15:29 -07:00
Aleksa Sarai	b810da1490	cgroups: systemd: make use of Device= properties It seems we missed that systemd added support for the devices cgroup, as a result systemd would actually write an allow-all rule each time you did 'runc update'* if you used the systemd cgroup driver. This is obviously ... bad and was a clear security bug. Luckily the commits which introduced this were never in an actual runc release. So we simply generate the cgroupv1-style rules (which is what systemd's DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it turns out that systemd is susceptible to the same spurrious error failure that we were, so that problem is out of our hands for systemd cgroup users. However, systemd has a similar bug to the one fixed in [1]. It will happily write a disruptive deny-all rule when it is not necessary. Unfortunately, we cannot even use devices.Emulator to generate a minimal set of transition rules because the DBus API is limited (you can only clear or append to the DeviceAllow= list -- so we are forced to always clear it). To work around this, we simply freeze the container during SetUnitProperties. [1]: `afe83489d4` ("cgroupv1: devices: use minimal transition rules with devices.Emulator") Fixes: `1d4ccc8e0c` ("fix data inconsistent when runc update in systemd driven cgroup v1") Fixes: `7682a2b2a5` ("fix data inconsistent when runc update in systemd driven cgroup v2") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:43:56 +10:00
Aleksa Sarai	859a780d6f	cgroups: add GetFreezerState() helper to Manager This is effectively a nicer implementation of the container.isPaused() helper, but to be used within the cgroup code for handling some fun issues we have to fix with the systemd cgroup driver. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Kir Kolyshkin	714c91e9f7	Simplify cgroup path handing in v2 via unified API This unties the Gordian Knot of using GetPaths in cgroupv2 code. The problem is, the current code uses GetPaths for three kinds of things: 1. Get all the paths to cgroup v1 controllers to save its state (see (linuxContainer).currentState(), (LinuxFactory).loadState() methods). 2. Get all the paths to cgroup v1 controllers to have the setns process enter the proper cgroups in `(*setnsProcess).start()`. 3. Get the path to a specific controller (for example, `m.GetPaths()["devices"]`). Now, for cgroup v2 instead of a set of per-controller paths, we have only one single unified path, and a dedicated function `GetUnifiedPath()` to get it. This discrepancy between v1 and v2 cgroupManager API leads to the following problems with the code: - multiple if/else code blocks that have to treat v1 and v2 separately; - backward-compatible GetPaths() methods in v2 controllers; - - repeated writing of the PID into the same cgroup for v2; Overall, it's hard to write the right code with all this, and the code that is written is kinda hard to follow. The solution is to slightly change the API to do the 3 things outlined above in the same manner for v1 and v2: 1. Use `GetPaths()` for state saving and setns process cgroups entering. 2. Introduce and use Path(subsys string) to obtain a path to a subsystem. For v2, the argument is ignored and the unified path is returned. This commit converts all the controllers to the new API, and modifies all the users to use it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-08 12:04:06 -07:00
Kir Kolyshkin	24f945e08d	libct/cgroups/systemd/v2: return a public interface Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-08 10:06:02 -07:00
Akihiro Suda	bf15cc99b1	cgroup v2: support rootless systemd Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 . $ podman --cgroup-manager=systemd run -it --rm --runtime=runc \ --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine / # cat /proc/self/cgroup 0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max 44040192 / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max 42000 100000 / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max 42 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-05-08 12:39:20 +09:00
lifubang	a70f354680	let runc disable swap in cgroup v2 In cgroup v2, when memory and memorySwap set to the same value which is greater than zero, runc should write zero in `memory.swap.max` to disable swap. Signed-off-by: lifubang <lifubang@acmcoder.com>	2020-05-03 20:57:36 +08:00
lifubang	bfa1b2aab3	check that StartTransientUnit and StopUnit succeeds Signed-off-by: lifubang <lifubang@acmcoder.com>	2020-04-28 15:46:28 +08:00
lifubang	7682a2b2a5	fix data inconsistent when runc update in systemd driven cgroup v2 Signed-off-by: lifubang <lifubang@acmcoder.com>	2020-04-23 19:32:07 +08:00
Kir Kolyshkin	4b4bc995ad	CreateCgroupPath: only enable needed controllers 1. Instead of enabling all available controllers, figure out which ones are required, and only enable those. 2. Amend all setFoo() functions to call isFooSet(). While this might seem unnecessary, it might actually help to uncover a bug. Imagine someone: - adds a cgroup.Resources.CpuFoo setting; - modifies setCpu() to apply the new setting; - but forgets to amend isCpuSet() accordingly <-- BUG In this case, a test case modifying CpuFoo will help to uncover the BUG. This is the reason why it's added. This patch could be amended by enabling controllers on a best-effort basis, i.e. : - do not return an error early if we can't enable some controllers; - if we fail to enable all controllers at once (usually because one of them can't be enabled), try enabling them one by one. Currently this is not implemented, and it's not clear whether this would be a good way to go or not. [v2: add/use is${Controller}Set() functions] [v3: document neededControllers()] [v4: drop "best-effort" part] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-04-19 16:27:40 -07:00
Kir Kolyshkin	bb47e35843	cgroup/systemd: reorganize 1. Rename the files - v1.go: cgroupv1 aka legacy; - v2.go: cgroupv2 aka unified hierarchy; - unsupported.go: when systemd is not available. 2. Move the code that is common between v1 and v2 to common.go Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-04-19 16:27:40 -07:00

24 Commits