Commit Graph

51 Commits

Author SHA1 Message Date
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Aleksa Sarai
63ee74376e merge branch 'pr-2958'
Kir Kolyshkin (2)
  libct/cg/sd: fix SkipDevices for systemd
  libct/cg/sd: add SkipDevices unit test

LGTMs: mrunalp AkihiroSuda cyphar
Closes #2958
2021-05-28 14:24:05 +10:00
Kir Kolyshkin
752e7a8249 libct/cg/sd: fix SkipDevices for systemd
Commit 108ee85b82 adds SkipDevices flag, which is used by kubernetes
to create cgroups for pods.

Unfortunately the above commit falls short, and systemd DevicePolicy and
DeviceAllow properties are still set, which requires kubernetes to set
"allow everything" rule.

This commit fixes this: if SkipDevices flag is set, we return
Device* properties to allow all devices.

Fixes: 108ee85b82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-24 17:00:37 -07:00
Kir Kolyshkin
33c9f8b9c7 libct/cg/sd: return error from stopUnit
Historically, we never returned an error from failed startUnit
or stopUnit. The startUnit case was fixed by commit 3844789.

It is time to fix stopUnit, too. The reasons are:

1. Ignoring an error from stopUnit means an unexpected trouble down the
   road, for example a failure to create a container with the same name:

   > time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists."

2. A somewhat short timeout of 1 second means the cgroup might
   actually be removed a few seconds later but we might have a
   race between removing the cgroup and creating another one
   with the same name, resulting in the same error as amove.

So, return an error if removal failed, and increase the timeout.

Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit (stopErr) with the subsequent one from cgroups.RemovePath,
as stopErr is most probably the reason why RemovePath failed.

Note that for v1 we do want to remove the paths even in case of a
failure from stopUnit, as some were not created by systemd.
There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-05-12 11:40:21 -07:00
Kir Kolyshkin
3f65946756 libct/cg: make Set accept configs.Resources
A cgroup manager's Set method sets cgroup resources, but historically
it was accepting configs.Cgroups.

Refactor it to accept resources only. This is an improvement from the
API point of view, as the method can not change cgroup configuration
(such as path to the cgroup etc), it can only set (modify) its
resources/limits.

This also lays the foundation for complicated resource updates, as now
Set has two sets of resources -- the one that was previously specified
during cgroup manager creation (or the previous Set), and the one passed
in the argument, so it could deduce the difference between these. This
is a long term goal though.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-29 15:24:19 -07:00
Kir Kolyshkin
47ef9a104f libct/cg/sd: retry on dbus disconnect
Instead of reconnecting to dbus after some failed operations, and
returning an error (so a caller has to retry), reconnect AND retry
in place for all such operations.

This should fix issues caused by a stale dbus connection after e.g.
a dbus daemon restart.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-27 16:19:49 -07:00
Shiming Zhang
15fee9899f libct/cg/sd: add renew dbus connection
[@kolyshkin: doc nits, use dbus.ErrClosed and isDbusError]

Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-27 16:16:05 -07:00
Shiming Zhang
cdbed6f02f libct/cg/sd: add dbus manager
[@kolyshkin: documentation nits]

Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-27 16:02:39 -07:00
Qiang Huang
2d38476c96 Merge pull request #2840 from kolyshkin/ignore-kmem
Ignore kernel memory settings
2021-04-13 09:44:14 +08:00
Kir Kolyshkin
52390d6804 Ignore kernel memory settings
This is somewhat radical approach to deal with kernel memory.

Per-cgroup kernel memory limiting was always problematic. A few
examples:

 - older kernels had bugs and were even oopsing sometimes (best example
   is RHEL7 kernel);
 - kernel is unable to reclaim the kernel memory so once the limit is
   hit a cgroup is toasted;
 - some kernel memory allocations don't allow failing.

In addition to that,

 - users don't have a clue about how to set kernel memory limits
   (as the concept is much more complicated than e.g. [user] memory);
 - different kernels might have different kernel memory usage,
   which is sort of unexpected;
 - cgroup v2 do not have a [dedicated] kmem limit knob, and thus
   runc silently ignores kernel memory limits for v2;
 - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
   https://github.com/torvalds/linux/commit/0158115f702b).

In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).

This should result in less bugs and better user experience.

The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).

[v2: add a warning in specconv that limits are ignored]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-12 12:18:11 -07:00
Kir Kolyshkin
f585cec7dc libct/cg/v2: always enable TasksAccounting
This unconditionally enables TasksAccounting for systemd unified (v2)
cgroup driver, making it work the same way as the legacy (v1) driver.

Practically, it is probably a no-op since DefaultTasksAccounting is
usually true.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-09 20:17:21 -08:00
Kir Kolyshkin
5d0ffbf9c8 runc start/run: report OOM
In some cases, container init fails to start because it is killed by
the kernel OOM killer. The errors returned by runc in such cases are
semi-random and rather cryptic. Below are a few examples.

On cgroup v1 + systemd cgroup driver:

> process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe

> process_linux.go:352: getting the final child's pid from pipe caused: EOF

On cgroup v2:

> process_linux.go:495: container init caused: read init-p: connection reset by peer

> process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe

This commits adds the OOM method to cgroup managers, which tells whether
the container was OOM-killed. In case that has happened, the original error
is discarded (unless --debug is set), and the new OOM error is reported
instead:

> ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)

Also, fix the rootless test cases that are failing because they expect
an error in the first line, and we have an additional warning now:

> unable to get oom kill count" error="no directory specified for memory.oom_control

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-23 16:15:33 -08:00
Kir Kolyshkin
af521ed580 libct/cgroups/systemd: don't set limits in Apply
All cgroup managers has Apply() and Set() methods:

 - Apply is used to create a cgroup (and, in case of systemd,
   a systemd unit) and/or put a PID into the cgroup (and unit);

 - Set is used to set various cgroup resources and limits.

The fs/fs2 cgroup manager implements the functionality as described above.

The systemd v1/v2 manager deviate -- it sets *most* of cgroup limits
(those that can be projected to systemd unit properties) in Apply(),
and then again *all* cgroup limits in Set (first indirectly via systemd
properties -- same as in Apply, then via cgroupfs).

This commit removes setting the cgroup limits from Apply,
so now the systemd manager behaves the same way as the fs manager.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-02-19 18:53:24 -08:00
Kir Kolyshkin
03b512e511 libc/cg: convert r.CPU.Cpus/Mems to systemd props
Support for systemd properties AllowedCPUs and AllowedMemoryNodes
was added by commit 13afa58d0e, but only for unified resources
of systemd v2 driver.

This adds support for Cpu.Cpus and Cpu.Mems resources to
both systemd v1 and v2 cgroup drivers.

An integration test is added to check that the settings work.

[v2: check for systemd version]
[v3: same in the test]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-13 19:28:54 -08:00
Kir Kolyshkin
f0fdde79d2 libct/cg/systemd/v1: fix err check in enableKmem
Commit 27d3dd3df3 ("don't fail when subsystem not mounted") added
ignoring "not found" error to enableKmem, and as a result the function
now tries to call Mkdir with an empty path, which results in a weird
error message. For example, this is a failure from a
libcontainer/integration test:

> === RUN   TestRunWithKernelMemorySystemd
>    exec_test.go:704: runContainer failed with kernel memory limit: container_linux.go:370: starting container process caused: process_linux.go:327: applying cgroup configuration for process caused: mkdir : no such file or directory

I am not entirely sure if it is a good idea to silently ignore set
limits, but at least let's fix the error handling.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-10-05 20:51:02 -07:00
Kir Kolyshkin
c1bba720f7 libct/cg/systemd/v1: do not use c.Path
Commit a1d5398afa ("Respect container's cgroup path") added a
cgroupPath argument to FindCgroupMountpoint to make runc/libcontainer
work in a custom multitenant environment with multiple cgroup mount
points.

It also added passing c.Path as an argument to FindCgroupMountpoint
for systemd (v1) controller. This is wrong, because

1. systemd controller do not use c.Path at all (and c.Path is never set
   by specconv) -- instead, it uses Name and Parent.

2. c.Path, if set, is not absolute -- it is relative to /sys/fs/cgroup
   -- but it is used as an absolute path here.

Since c.Path is never set, the change did not result in any breakage, so
this code sit quietly for some time and the issue might not have been
discovered -- until we started running libcontainer/integration tests
in a CentOS 7 VM, which resulted in a following weird error:

> FAIL: TestPidsSystemd: utils_test.go:55: exec_test.go:630: unexpected error: container_linux.go:353: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: mountpoint for devices not found

The error was "fixed" in commit f57bb2fe3d by changing the tests'
cgroups Path to be "/sys/fs/cgroup/". This actually resulted in
creation of cgroup directories like /sys/fs/cgroup/memory/sys/fs/cgroup,
/sys/fs/cgroup/devices/sys/fs/cgroup and so on.

The proper fix to the test case is implemented in the previous commit,
which sets c.Name and c.Parent.

This commit just removes the invalid use of c.Path, and tells the whole
story.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-10-05 20:51:02 -07:00
Kir Kolyshkin
9e78b66e88 libct/cg/systemd/v1.enableKmem: use fscommon.ReadFile
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-10-05 14:07:15 -07:00
Kir Kolyshkin
b006f4a180 libct/cgroups: support Cgroups.Resources.Unified
Add support for unified resource map (as per [1]), and add some test
cases for the new functionality.

[1] https://github.com/opencontainers/runtime-spec/pull/1040

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-09-24 15:29:35 -07:00
Kir Kolyshkin
940e15479f cgroupv1/systemd: (re)use m.paths
In all these cases, getSubsystemPath() was already called, and its
result stored in m.paths map. It makes no sense to not reuse it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-08-25 11:42:24 -07:00
Kir Kolyshkin
f075084a47 cgroupv1/systemd: rework Apply/joinCgroups
We call joinCgroups() from Apply, and in there we iterate through the
list of subsystems, calling getSubsystemPath() for each. This is
expensive, since every getSubsystemPath() involves parsing mountinfo.

At the end of Apply(), we iterate through the list of subsystems to fill
the m.paths, again calling getSubsystemPath() for every subsystem.

As a result, we parse mountinfo about 20 times here.

Let's find the paths first and reuse m.paths in joinCgroups().

While at it, since join() is just two calls now, inline it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-08-25 11:42:21 -07:00
Kir Kolyshkin
fad92bbffa cgroupv1/Apply: do not overuse d.path/getSubsystemPath
When paths are set, we only need to place the PID into proper
cgroups, and we do know all the paths already.

Both fs/d.path() and systemd/v1/getSubsystemPath() parse
/proc/self/mountinfo, and the only reason they are used
here is to check whether the subsystem is available.

Use a much simpler/faster check instead.

Frankly, I am not sure why the check is needed at all. Maybe it should
be dropped.

Also, for fs driver, since d is no longer used in this code path,
move its initialization to after it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-08-25 11:40:58 -07:00
Mrunal Patel
a5847db387 Merge pull request #2506 from kolyshkin/cgroup-fixes
cgroupv1 removal nits
2020-08-17 21:13:31 -07:00
Kir Kolyshkin
2a322e91ec cgroupv1: remove subsystemSet.Get()
Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.

Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:31:46 -07:00
Kir Kolyshkin
254d23b964 libc/cgroups: empty map in RemovePaths
RemovePaths() deletes elements from the paths map for paths that has
been successfully removed.

Although, it does not empty the map itself (which is needed that AFAIK
Go garbage collector does not shrink the map), but all its callers do.

Move this operation from callers to RemovePaths.

No functional change, except the old map should be garbage collected now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 17:54:44 -07:00
Mrunal Patel
30dc54a995 Merge pull request #2503 from giuseppe/cgroup-fixes
cgroup, systemd: cleanup cgroups
2020-07-06 15:14:29 -07:00
Mrunal Patel
3f81131845 Merge pull request #2490 from kolyshkin/dev-opt
libct/cgroups: add SkipDevices to Resources
2020-07-06 14:28:30 -07:00
Giuseppe Scrivano
32034481ea cgroup, systemd: cleanup cgroups
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups.  systemd doesn't manage these and as a result we leak
these cgroups.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-07-06 23:06:16 +02:00
Kir Kolyshkin
cd479f9d14 cgroupv1/freezer: don't use subsystemSet.Get()
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-03 14:00:44 -07:00
Kir Kolyshkin
108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Mrunal Patel
406298fdf0 Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period
cgroups/systemd: add setting CPUQuotaPeriod prop
2020-06-17 12:03:30 -07:00
Kir Kolyshkin
e751a168dc cgroups/systemd: add setting CPUQuotaPeriod prop
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).

Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".

[v2: add systemd version checks since CPUQuotaPeriod requires v242+]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 15:48:06 -07:00
Kir Kolyshkin
dd2426d067 libct/cgroups: fix m.paths map access
This fixes a few cases of accessing m.paths map directly without holding
the mutex lock.

Fixes: 9087f2e82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-15 18:30:16 -07:00
Kir Kolyshkin
5b247e739c Merge pull request #2338 from lifubang/systemdcgroupv2
fix path error in systemd when stopped

LGTMs: @mrunalp @AkihiroSuda
2020-06-15 18:01:13 -07:00
Kir Kolyshkin
8b9646775e cgroups/systemd: unify adding CpuQuota
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.

No functional change.

Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.

[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:14:43 -07:00
Kir Kolyshkin
2ce20ed158 cgroups/systemd: simplify gen*ResourcesProperties
Use r instead of c.Resources for readability. No functional change.

This commit has been brought to you by '<,'>s/c\.Resources\./r./g

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-08 13:42:09 -07:00
lifubang
9087f2e827 fix path error in systemd when stopped
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-06-02 18:17:43 +08:00
Mrunal Patel
332a84581e Merge pull request #2443 from kolyshkin/kmem-fixup
cgroupv1/systemd.Set: don't enable kernel memory acct
2020-05-31 10:04:45 -07:00
Kir Kolyshkin
3fe6e04510 cgroupv1/systemd.Set: don't enable kernel memory acct
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().

While at it, rename the method to better reflect what it's doing.

This saves 1 call to mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 17:54:50 -07:00
Kir Kolyshkin
3249e2379c cgroupv1: check cpu shares in place
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.

The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 16:46:28 -07:00
Kir Kolyshkin
be5467872d cgroupv1: minimal fix for cpu quota regression
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.

The proper fix will follow.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-26 11:02:16 -07:00
Kir Kolyshkin
59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin
06d7c1d261 systemd+cgroupv1: fix updating CPUQuotaPerSecUSec
1. do not allow to set quota without period or period without quota, as we
   won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.

2. do not ignore setting quota to -1 when a period is not set.

3. update the test case accordingly.

Note that systemd value checks will be added in the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:17:18 -07:00
Aleksa Sarai
b810da1490 cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai
859a780d6f cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Kir Kolyshkin
714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin
51e1a0842d libct/cgroups/systemd/v1: privatize v1 manager
This patch was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:48 -07:00
Kir Kolyshkin
d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Akihiro Suda
bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
lifubang
bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
lifubang
1d4ccc8e0c fix data inconsistent when runc update in systemd driven cgroup v1
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:57 +08:00