Commit Graph

24 Commits

Author SHA1 Message Date
Kir Kolyshkin
13afa58d0e libct/cg/sd/v2: support cpuset.* / Allowed*
* cpuset.cpus -> AllowedCPUs
 * cpuset.mems -> AllowedMemoryNodes

No test for cgroup v2 resources.unified override, as this requires a
separate test case, and all the unified resources are handled uniformly
so there's little sense to test all parameters.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-11-05 16:04:57 -08:00
Kir Kolyshkin
5be8b97aec libct/cg/sd/v2: support cpu.weight / CPUWeight
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-11-05 16:04:46 -08:00
Kir Kolyshkin
ab80eb32d2 libct/cg/sd/v2: support cpu.max unified resource
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-11-05 16:04:37 -08:00
Kir Kolyshkin
0cb8bf67a3 Initial v2 resources.unified systemd support
In case systemd is used as cgroups manager, and a user sets some
resources using unified resource map (as per [1]), systemd is not
aware of any parameters, so there will be a discrepancy between
the cgroupfs state and systemd unit state.

Let's try to fix that by converting known unified resources to systemd
properties.

Currently, this is only implemented for pids.max as a POC.

Some other parameters (that might or might not have systemd unit
property equivalents) are:

$ ls -l | grep w-
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.freeze
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.max.depth
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.max.descendants
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.procs
-rw-r--r--. 1 root root 0 Oct 21 09:43 cgroup.subtree_control
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.threads
-rw-r--r--. 1 root root 0 Oct 10 13:57 cgroup.type
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.max
-rw-r--r--. 1 root root 0 Oct 10 13:57 cpu.pressure
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.cpus
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.cpus.partition
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpuset.mems
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.weight
-rw-r--r--. 1 root root 0 Oct 22 10:30 cpu.weight.nice
-rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.1GB.max
-rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.1GB.rsvd.max
-rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.2MB.max
-rw-r--r--. 1 root root 0 Oct 22 10:30 hugetlb.2MB.rsvd.max
-rw-r--r--. 1 root root 0 Oct 22 10:30 io.bfq.weight
-rw-r--r--. 1 root root 0 Oct 22 10:30 io.latency
-rw-r--r--. 1 root root 0 Oct 22 10:30 io.max
-rw-r--r--. 1 root root 0 Oct 10 13:57 io.pressure
-rw-r--r--. 1 root root 0 Oct 22 10:30 io.weight
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.high
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.low
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.max
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.min
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.oom.group
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.pressure
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.swap.high
-rw-r--r--. 1 root root 0 Oct 10 13:57 memory.swap.max

Surely, it is a manual conversion for every such case...

[1] https://github.com/opencontainers/runtime-spec/pull/1040

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-11-05 16:04:19 -08:00
Kir Kolyshkin
108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Mrunal Patel
406298fdf0 Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period
cgroups/systemd: add setting CPUQuotaPeriod prop
2020-06-17 12:03:30 -07:00
Kir Kolyshkin
e751a168dc cgroups/systemd: add setting CPUQuotaPeriod prop
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).

Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".

[v2: add systemd version checks since CPUQuotaPeriod requires v242+]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 15:48:06 -07:00
Kir Kolyshkin
5b247e739c Merge pull request #2338 from lifubang/systemdcgroupv2
fix path error in systemd when stopped

LGTMs: @mrunalp @AkihiroSuda
2020-06-15 18:01:13 -07:00
Kir Kolyshkin
8b9646775e cgroups/systemd: unify adding CpuQuota
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.

No functional change.

Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.

[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:14:43 -07:00
Kir Kolyshkin
2ce20ed158 cgroups/systemd: simplify gen*ResourcesProperties
Use r instead of c.Resources for readability. No functional change.

This commit has been brought to you by '<,'>s/c\.Resources\./r./g

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-08 13:42:09 -07:00
lifubang
9087f2e827 fix path error in systemd when stopped
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-06-02 18:17:43 +08:00
Kir Kolyshkin
3c6e8ac4d2 cgroupv2: set mem+swap to max if mem set to max
... and mem+swap is not explicitly set otherwise.

This ensures compatibility with cgroupv1 controller which interprets
things this way.

With this fixed, we can finally enable swap tests for cgroupv2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-22 21:32:16 -07:00
Kir Kolyshkin
59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin
e4a84bea99 cgroupv2+systemd: set MemoryLow
For some reason, this was not set before.

Test case is added by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:29 -07:00
Aleksa Sarai
b810da1490 cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai
859a780d6f cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Kir Kolyshkin
714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin
24f945e08d libct/cgroups/systemd/v2: return a public interface
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:02 -07:00
Akihiro Suda
bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
lifubang
a70f354680 let runc disable swap in cgroup v2
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-03 20:57:36 +08:00
lifubang
bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
lifubang
7682a2b2a5 fix data inconsistent when runc update in systemd driven cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:07 +08:00
Kir Kolyshkin
4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin
bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00