gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 108ee85b82 adds SkipDevices flag, which is used by kubernetes
to create cgroups for pods.
Unfortunately the above commit falls short, and systemd DevicePolicy and
DeviceAllow properties are still set, which requires kubernetes to set
"allow everything" rule.
This commit fixes this: if SkipDevices flag is set, we return
Device* properties to allow all devices.
Fixes: 108ee85b82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Historically, we never returned an error from failed startUnit
or stopUnit. The startUnit case was fixed by commit 3844789.
It is time to fix stopUnit, too. The reasons are:
1. Ignoring an error from stopUnit means an unexpected trouble down the
road, for example a failure to create a container with the same name:
> time="2021-05-07T19:51:27Z" level=error msg="container_linux.go:380: starting container process caused: process_linux.go:385: applying cgroup configuration for process caused: Unit runc-test_busybox.scope already exists."
2. A somewhat short timeout of 1 second means the cgroup might
actually be removed a few seconds later but we might have a
race between removing the cgroup and creating another one
with the same name, resulting in the same error as amove.
So, return an error if removal failed, and increase the timeout.
Now, modify the systemd cgroup v1 manager to not mask the error from
stopUnit (stopErr) with the subsequent one from cgroups.RemovePath,
as stopErr is most probably the reason why RemovePath failed.
Note that for v1 we do want to remove the paths even in case of a
failure from stopUnit, as some were not created by systemd.
There's no need to do that for v2, thanks to unified hierarchy,
so no changes there.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A cgroup manager's Set method sets cgroup resources, but historically
it was accepting configs.Cgroups.
Refactor it to accept resources only. This is an improvement from the
API point of view, as the method can not change cgroup configuration
(such as path to the cgroup etc), it can only set (modify) its
resources/limits.
This also lays the foundation for complicated resource updates, as now
Set has two sets of resources -- the one that was previously specified
during cgroup manager creation (or the previous Set), and the one passed
in the argument, so it could deduce the difference between these. This
is a long term goal though.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of reconnecting to dbus after some failed operations, and
returning an error (so a caller has to retry), reconnect AND retry
in place for all such operations.
This should fix issues caused by a stale dbus connection after e.g.
a dbus daemon restart.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[@kolyshkin: doc nits, use dbus.ErrClosed and isDbusError]
Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is somewhat radical approach to deal with kernel memory.
Per-cgroup kernel memory limiting was always problematic. A few
examples:
- older kernels had bugs and were even oopsing sometimes (best example
is RHEL7 kernel);
- kernel is unable to reclaim the kernel memory so once the limit is
hit a cgroup is toasted;
- some kernel memory allocations don't allow failing.
In addition to that,
- users don't have a clue about how to set kernel memory limits
(as the concept is much more complicated than e.g. [user] memory);
- different kernels might have different kernel memory usage,
which is sort of unexpected;
- cgroup v2 do not have a [dedicated] kmem limit knob, and thus
runc silently ignores kernel memory limits for v2;
- kernel v5.4 made cgroup v1 kmem.limit obsoleted (see
https://github.com/torvalds/linux/commit/0158115f702b).
In view of all this, and as the runtime-spec lists memory.kernel
and memory.kernelTCP as OPTIONAL, let's ignore kernel memory
limits (for cgroup v1, same as we're already doing for v2).
This should result in less bugs and better user experience.
The only bad side effect from it might be that stat can show kernel
memory usage as 0 (since the accounting is not enabled).
[v2: add a warning in specconv that limits are ignored]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This unconditionally enables TasksAccounting for systemd unified (v2)
cgroup driver, making it work the same way as the legacy (v1) driver.
Practically, it is probably a no-op since DefaultTasksAccounting is
usually true.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some cases, container init fails to start because it is killed by
the kernel OOM killer. The errors returned by runc in such cases are
semi-random and rather cryptic. Below are a few examples.
On cgroup v1 + systemd cgroup driver:
> process_linux.go:348: copying bootstrap data to pipe caused: write init-p: broken pipe
> process_linux.go:352: getting the final child's pid from pipe caused: EOF
On cgroup v2:
> process_linux.go:495: container init caused: read init-p: connection reset by peer
> process_linux.go:484: writing syncT 'resume' caused: write init-p: broken pipe
This commits adds the OOM method to cgroup managers, which tells whether
the container was OOM-killed. In case that has happened, the original error
is discarded (unless --debug is set), and the new OOM error is reported
instead:
> ERRO[0000] container_linux.go:367: starting container process caused: container init was OOM-killed (memory limit too low?)
Also, fix the rootless test cases that are failing because they expect
an error in the first line, and we have an additional warning now:
> unable to get oom kill count" error="no directory specified for memory.oom_control
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All cgroup managers has Apply() and Set() methods:
- Apply is used to create a cgroup (and, in case of systemd,
a systemd unit) and/or put a PID into the cgroup (and unit);
- Set is used to set various cgroup resources and limits.
The fs/fs2 cgroup manager implements the functionality as described above.
The systemd v1/v2 manager deviate -- it sets *most* of cgroup limits
(those that can be projected to systemd unit properties) in Apply(),
and then again *all* cgroup limits in Set (first indirectly via systemd
properties -- same as in Apply, then via cgroupfs).
This commit removes setting the cgroup limits from Apply,
so now the systemd manager behaves the same way as the fs manager.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Support for systemd properties AllowedCPUs and AllowedMemoryNodes
was added by commit 13afa58d0e, but only for unified resources
of systemd v2 driver.
This adds support for Cpu.Cpus and Cpu.Mems resources to
both systemd v1 and v2 cgroup drivers.
An integration test is added to check that the settings work.
[v2: check for systemd version]
[v3: same in the test]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 27d3dd3df3 ("don't fail when subsystem not mounted") added
ignoring "not found" error to enableKmem, and as a result the function
now tries to call Mkdir with an empty path, which results in a weird
error message. For example, this is a failure from a
libcontainer/integration test:
> === RUN TestRunWithKernelMemorySystemd
> exec_test.go:704: runContainer failed with kernel memory limit: container_linux.go:370: starting container process caused: process_linux.go:327: applying cgroup configuration for process caused: mkdir : no such file or directory
I am not entirely sure if it is a good idea to silently ignore set
limits, but at least let's fix the error handling.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit a1d5398afa ("Respect container's cgroup path") added a
cgroupPath argument to FindCgroupMountpoint to make runc/libcontainer
work in a custom multitenant environment with multiple cgroup mount
points.
It also added passing c.Path as an argument to FindCgroupMountpoint
for systemd (v1) controller. This is wrong, because
1. systemd controller do not use c.Path at all (and c.Path is never set
by specconv) -- instead, it uses Name and Parent.
2. c.Path, if set, is not absolute -- it is relative to /sys/fs/cgroup
-- but it is used as an absolute path here.
Since c.Path is never set, the change did not result in any breakage, so
this code sit quietly for some time and the issue might not have been
discovered -- until we started running libcontainer/integration tests
in a CentOS 7 VM, which resulted in a following weird error:
> FAIL: TestPidsSystemd: utils_test.go:55: exec_test.go:630: unexpected error: container_linux.go:353: starting container process caused: process_linux.go:326: applying cgroup configuration for process caused: mountpoint for devices not found
The error was "fixed" in commit f57bb2fe3d by changing the tests'
cgroups Path to be "/sys/fs/cgroup/". This actually resulted in
creation of cgroup directories like /sys/fs/cgroup/memory/sys/fs/cgroup,
/sys/fs/cgroup/devices/sys/fs/cgroup and so on.
The proper fix to the test case is implemented in the previous commit,
which sets c.Name and c.Parent.
This commit just removes the invalid use of c.Path, and tells the whole
story.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In all these cases, getSubsystemPath() was already called, and its
result stored in m.paths map. It makes no sense to not reuse it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
We call joinCgroups() from Apply, and in there we iterate through the
list of subsystems, calling getSubsystemPath() for each. This is
expensive, since every getSubsystemPath() involves parsing mountinfo.
At the end of Apply(), we iterate through the list of subsystems to fill
the m.paths, again calling getSubsystemPath() for every subsystem.
As a result, we parse mountinfo about 20 times here.
Let's find the paths first and reuse m.paths in joinCgroups().
While at it, since join() is just two calls now, inline it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When paths are set, we only need to place the PID into proper
cgroups, and we do know all the paths already.
Both fs/d.path() and systemd/v1/getSubsystemPath() parse
/proc/self/mountinfo, and the only reason they are used
here is to check whether the subsystem is available.
Use a much simpler/faster check instead.
Frankly, I am not sure why the check is needed at all. Maybe it should
be dropped.
Also, for fs driver, since d is no longer used in this code path,
move its initialization to after it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.
Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
RemovePaths() deletes elements from the paths map for paths that has
been successfully removed.
Although, it does not empty the map itself (which is needed that AFAIK
Go garbage collector does not shrink the map), but all its callers do.
Move this operation from callers to RemovePaths.
No functional change, except the old map should be garbage collected now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups. systemd doesn't manage these and as a result we leak
these cgroups.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.
The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.
Work around this by adding a SkipDevices flag to Resources.
A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).
Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".
[v2: add systemd version checks since CPUQuotaPeriod requires v242+]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes a few cases of accessing m.paths map directly without holding
the mutex lock.
Fixes: 9087f2e82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.
No functional change.
Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.
[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use r instead of c.Resources for readability. No functional change.
This commit has been brought to you by '<,'>s/c\.Resources\./r./g
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.
Signed-off-by: lifubang <lifubang@acmcoder.com>
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().
While at it, rename the method to better reflect what it's doing.
This saves 1 call to mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.
The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.
The proper fix will follow.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.
Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:
# runc --systemd-cgroup update $CT --pids-limit 42
# systemctl show runc-$CT.scope | grep TasksMax
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
42
# ./runc --systemd-cgroup update $CT --pids-limit -1
# systemctl show runc-$CT.scope | grep TasksMax=
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
max
Fix by changing the condition to allow -1 as a valid value.
NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.
A test case is added.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. do not allow to set quota without period or period without quota, as we
won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.
2. do not ignore setting quota to -1 when a period is not set.
3. update the test case accordingly.
Note that systemd value checks will be added in the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.
So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.
However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.
[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")
Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>