Using strings.CutPrefix (available since Go 1.20) instead of
strings.HasPrefix and/or strings.TrimPrefix makes the code
a tad more straightforward.
No functional change.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If the sub-cgroup RemovePath has failed for any reason, return the
error right away. This way, we don't have to check for err != nil
before retrying rmdir.
This is a cosmetic change and should not change any functionality.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
An issue with runc 1.2.0 was reported to buildkit, in which
runc delete returns with an error, with the log saying:
> unable to destroy container: unable to remove container's cgroup: open /sys/fs/cgroup/snschvixiy3s74w74fjantrdg: no such file or directory
Apparently, what happens is runc is running with no cgroup access
(because /sys/fs/cgroup is mounted read-only). In this case error to
create a cgroup path (in runc create/run) is ignored, but cgroup removal
(in runc delete) is not.
This is caused by commit d3d7f7d, which changes the cgroup removal
logic in RemovePath. In the current code, if the initial rmdir has
failed (in this case with EROFS), but the subsequent os.ReadDir returns
ENOENT, it is returned (instead of being ignored -- as the path does not
exist and so there is nothing to remove).
Here is the minimal fix for the issue.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This allows to do
runc update $ID --memory=-1 --memory-swap=$VAL
for cgroup v2, i.e. set memory to unlimited and swap to a specific
value.
This was not possible because ConvertMemorySwapToCgroupV2Value rejected
memory=-1 ("unlimited"). In a hindsight, it was a mistake, because if
memory limit is unlimited, we should treat memory+swap limit as just swap
limit.
Revise the unit test; add description to each case.
Fixes: c86be8a2 ("cgroupv2: fix setting MemorySwap")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Improve readability of ConvertMemorySwapToCgroupV2Value by switching
from a bunch of if statements to a switch, and adding a comment
describing each case.
No functional change.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The userns package was moved to the moby/sys/userns module
at commit 3778ae603c.
This patch deprecates the old location, and adds it as an alias
for the moby/sys/userns package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There's too much logic here figuring out which CPUs to use. Runc is a
low level tool and is not supposed to be that "smart". What's worse,
this logic is executed on every exec, making it slower. Some of the
logic in (*setnsProcess).start is executed even if no annotation is set,
thus making ALL execs slow.
Also, this should be a property of a process, rather than annotation.
The plan is to rework this.
This reverts commit afc23e3397.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The current code is only doing retries in RemovePaths, which is only
used for cgroup v1 (cgroup v2 uses RemovePath, which makes no retries).
Let's remove all retry logic and logging from RemovePaths, together
with:
- os.Stat check from RemovePaths (its usage probably made sense before
commit 19be8e5ba5 but not after);
- error/warning logging from RemovePaths (this was added by commit
19be8e5ba5 in 2020 and so far we've seen no errors other
than EBUSY, so reporting the actual error proved to be useless).
Add the retry logic to rmdir, and the second retry bool argument.
Decrease the initial delay and increase the number of retries from the
old implementation so it can take up to ~1 sec before returning EBUSY
(was about 0.3 sec).
Hopefully, as a result, we'll have less "failed to remove cgroup paths"
errors.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.
Thus, these annotations are no longer necessary. Hooray!
[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Replace a panic with a warning, unless it's ENOENT and we're running in
a user namespace. In the latter case, do the same as before, i.e. report
the error but using a Debug logging level.
This prevents software that uses libcontainer from panicking in
some exotic setups.
This will also print a warning on some very old systems which does not
use /sys/fs/cgroup for cgroup mount point. My bet is such systems no
longer exist.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since commit 39914db679 this function is not used by runc (see
that commit to learn why this function is not that good).
I was not able to find any external users either.
Since it's not a good function, with no users, and it is rather trivial,
let's remove it right away (rather than mark as deprecated).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since Go 1.19, godoc recognizes lists, code blocks, headings etc. It
also reformats the sources making it more apparent that these features
are used.
Fix a few places where it misinterpreted the formatting (such as
indented vs unindented), and format the result using the gofumpt
from HEAD, which already incorporates gofmt 1.19 changes.
Some more fixes (and enhancements) might be required.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case statfs("/sys/fs/cgroup/unified") fails with any error other
than ENOENT, current code panics. As IsCgroup2HybridMode is called from
libcontainer/cgroups/fs's init function, this means that any user of
libcontainer may panic during initialization, which is ugly.
Avoid panicking; instead, do not enable hybrid hierarchy support and
report the error (under debug level, not to confuse anyone).
Basically, replace the panic with "turn off hybrid mode support"
(which makes total sense since we were unable to statfs its root).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Instead of distinguishing between errors and warnings, let's treat all
errors as warnings, thus simplifying the code. This changes the
function behaviour for input like hugepages-BadNumberKb --
previously, the error from Atoi("BadNumber") was considered fatal,
now it's just another warnings.
2. Move the warning logging to HugePageSizes, thus simplifying the test
case, which no longer needs to read what logrus writes. Note that we
do not want to log all the warnings (as chances are very low we'll
get any, and if we do this means the code need to be updated), only
the first one.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I have noticed that libct/cg/fs allocates 8K during init on every runc
execution:
> init github.com/opencontainers/runc/libcontainer/cgroups/fs @1.5 ms, 0.028 ms clock, 8512 bytes, 13 allocs
Apparently this is caused by global HugePageSizes variable init, which
is only used from GetStats (i.e. it is never used by runc itself).
Remove it, and use HugePageSizes() directly instead. Make it init-once,
so that GetStats (which, I guess, is periodically called by kubernetes)
does not re-read huge page sizes over and over.
This also removes 12 allocs and 8K from libct/cg/fs init section:
> $ time GODEBUG=inittrace=1 ./runc --help 2>&1 | grep cgroups/fs
> init github.com/opencontainers/runc/libcontainer/cgroups/fs @1.5 ms, 0.003 ms clock, 16 bytes, 1 allocs
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Since GetHugePageSize do not have any external users (checked by
sourcegraph), and no internal user ever uses its second return value
(the error), let's drop it.
2. Rename GetHugePageSize -> HugePageSizes (drop the Get prefix as per
Go guidelines, add suffix since we return many sizes).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently the parent process of the container is moved to the right
cgroup v2 tree when systemd is using a hybrid model (last line with 0::):
$ runc --systemd-cgroup run myid
/ # cat /proc/self/cgroup
12:cpuset:/system.slice/runc-myid.scope
11:blkio:/system.slice/runc-myid.scope
10:devices:/system.slice/runc-myid.scope
9:hugetlb:/system.slice/runc-myid.scope
8:memory:/system.slice/runc-myid.scope
7:rdma:/
6:perf_event:/system.slice/runc-myid.scope
5:net_cls,net_prio:/system.slice/runc-myid.scope
4:freezer:/system.slice/runc-myid.scope
3:pids:/system.slice/runc-myid.scope
2:cpu,cpuacct:/system.slice/runc-myid.scope
1:name=systemd:/system.slice/runc-myid.scope
0::/system.slice/runc-myid.scope
However, if a second process is executed in the same container, it is
not moved to the right cgroup v2 tree:
$ runc exec myid /bin/sh -c 'cat /proc/self/cgroup'
12:cpuset:/system.slice/runc-myid.scope
11:blkio:/system.slice/runc-myid.scope
10:devices:/system.slice/runc-myid.scope
9:hugetlb:/system.slice/runc-myid.scope
8:memory:/system.slice/runc-myid.scope
7:rdma:/
6:perf_event:/system.slice/runc-myid.scope
5:net_cls,net_prio:/system.slice/runc-myid.scope
4:freezer:/system.slice/runc-myid.scope
3:pids:/system.slice/runc-myid.scope
2:cpu,cpuacct:/system.slice/runc-myid.scope
1:name=systemd:/system.slice/runc-myid.scope
0::/user.slice/user-1000.slice/session-8.scope
This commit makes that processes executed with exec are placed into the
right cgroup v2 tree. The implementation checks if systemd is using a
hybrid mode (by checking if cgroups v2 is mounted in
/sys/fs/cgroup/unified), if yes, the path of the cgroup v2 slice for
this container is saved into the cgroup path list.
The fs group driver has a similar issue, in this case none of the runc
run or runc exec commands put the process in the right cgroups v2. This
commit also fixes that.
Having the processes of the container in its own cgroup v2 is useful
for any BPF programs that rely on bpf_get_current_cgroup_id(), like
https://github.com/kinvolk/inspektor-gadget/ for instance.
[@kolyshkin: rebased]
Signed-off-by: Mauricio Vásquez <mauricio@kinvolk.io>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
No need to add a file name to the error messages, as errors from
OpenFile and (*os.File).Write both contain the file name already.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Only some libcontainer packages can be built on non-linux platforms
(not that it make sense, but at least go build succeeds). Let's call
these "good" packages.
For all other packages (i.e. ones that fail to build with GOOS other
than linux), it does not make sense to have linux build tag (as they
are broken already, and thus are not and can not be used on anything
other than Linux).
Remove linux build tag for all non-"good" packages.
This was mostly done by the following script, with just a few manual
fixes on top.
function list_good_pkgs() {
for pkg in $(find . -type d -print); do
GOOS=freebsd go build $pkg 2>/dev/null \
&& GOOS=solaris go build $pkg 2>/dev/null \
&& echo $pkg
done | sed -e 's|^./||' | tr '\n' '|' | sed -e 's/|$//'
}
function remove_tag() {
sed -i -e '\|^// +build linux$|d' $1
go fmt $1
}
SKIP="^("$(list_good_pkgs)")"
for f in $(git ls-files . | grep .go$); do
if echo $f | grep -qE "$SKIP"; then
echo skip $f
continue
fi
echo proc $f
remove_tag $f
done
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since every cgroup directory is guaranteed to have cgroup.procs file,
we don't have to do filename comparison in GetAllPids() and just read
cgroup.procs in every directory.
While at it, switch readProcsFile to use our own OpenFile.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from unix.* are always bare and thus can be used directly.
Add //nolint:errorlint annotation to ignore errors such as these:
libcontainer/system/xattrs_linux.go:18:7: comparing with == will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
case errno == unix.ERANGE:
^
libcontainer/container_linux.go:1259:9: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if e != unix.EINVAL {
^
libcontainer/rootfs_linux.go:919:7: comparing with != will fail on wrapped errors. Use errors.Is to check for a specific error (errorlint)
if err != unix.EINVAL && err != unix.EPERM {
^
libcontainer/rootfs_linux.go:1002:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a better place as cgroups itself is using these.
Should help with moving more stuff common in between fs and fs2 to
fscommon.
Looks big, but this is just moving the code around:
fscommon/{fscommon,open}.go -> cgroups/file.go
fscommon/fscommon_test.go -> cgroups/file_test.go
and fixes for TestMode moved to a different package.
There's no functional change.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.
Looking at uses of these utilities (outside of runc itself);
`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.
[1]: https://grep.app/search?current=2&q=.RunningInUserNS
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
if bfq is not loaded, then io.bfq.weight is not available. io.weight
should always be available and is the next best equivalent thing.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
bfq weight controller (i.e. io.bfq.weight if present) is still using the
same bfq weight scheme (i.e 1->1000, see [1].) Unfortunately the
documentation for this was wrong, and only fixed recently [2].
Therefore, if we map blkio weight to io.bfq.weight, there's no need to
do any conversion. Otherwise, we will try to write invalid value which
results in error such as:
```
time="2021-02-03T14:55:30Z" level=error msg="container_linux.go:367: starting container process caused: process_linux.go:495: container init caused: process_linux.go:458: setting cgroup config for procHooks process caused: failed to write \"7475\": write /sys/fs/cgroup/runc-cgroups-integration-test/test-cgroup/io.bfq.weight: numerical result out of range"
```
[1] https://github.com/torvalds/linux/blob/master/Documentation/block/bfq-iosched.rst
[2] 65752aef0a
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
> libcontainer/cgroups/utils.go:282:4: SA4006: this value of `paths` is never used (staticcheck)
> paths = make(map[string]string)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a function to convert huge page sizes (obtained by reading
/sys/kernel/mm/hugepages directory entries) to strings user for hugetlb
cgroup controller resource files. Those strings are when used to get the
hugetlb resource statistics.
This function used external library, floating point numbers, and can
(theoretically) produce invalid values, since the kernel only uses KB,
MB, and GB suffixes.
Rewrite it to produce the same strings as used in the kernel (see [1]).
As a result, it's also faster, more future-proof (entries that do not
start with "hugepages-" and/or incorrect suffix are skipped), and does
more input sanity checks. As a side effect, libcontainer no longer
depends on docker/go-units.
While at it, add more test cases.
Before:
BenchmarkGetHugePageSize-8 187452 6265 ns/op
BenchmarkGetHugePageSizeImpl-8 396769 2998 ns/op
After:
BenchmarkGetHugePageSize-8 222898 4554 ns/op
BenchmarkGetHugePageSizeImpl-8 4738924 241 ns/op
NOTE on removing HugePageSizeUnitList -- this was added by commit
6f77e35da and was used by kubernetes code in [2], which was later
superceded by [3], so there are (hopefully) no external users.
If there are any, they should not be doing that.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?id=eff48ddeab782e35e58ccc8853f7386bbae9dec4#n574
[2] https://github.com/kubernetes/kubernetes/pull/78495
[3] https://github.com/kubernetes/kubernetes/pull/84154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
ioutil.ReadFile does a stat() on every entry and returns a slice of
os.Stat structures. What we need here is just a file name.
This change both simplifies and speeds up the code a bit.
Before:
BenchmarkGetHugePageSize-8 115213 9400 ns/op
After:
BenchmarkGetHugePageSize-8 190326 6187 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using os.RemoveAll has the following two issues:
1. it tries to remove all files, which does not make sense for cgroups;
2. it tries rm(2) which fails to directories, and then rmdir(2).
Let's reuse our RemovePath instead, and add warnings and errors logging.
PS I am somewhat hesitant to remove the weird checking my means of stat,
as it might break something. Unfortunately, neither commit 6feb7bda04
nor the PR it contains [1] do not explain what kind of weird errors were
seen from os.RemoveAll. Most probably our code won't return any bogus
errors, but let's keep the old code to be on the safe side.
[1] https://github.com/docker-archive/libcontainer/pull/308
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
RemovePaths() deletes elements from the paths map for paths that has
been successfully removed.
Although, it does not empty the map itself (which is needed that AFAIK
Go garbage collector does not shrink the map), but all its callers do.
Move this operation from callers to RemovePaths.
No functional change, except the old map should be garbage collected now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.
There are no code changes in this commit, just moving it from one file
to another.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is cgroupv1-specific, is only used once, and its name
is very close to the name of another function, FindCgroupMountpoint.
Inline it into the (only) caller.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is only called from cgroupv1 code, so there is no need
for it to implement cgroupv2 stuff.
Make it v1-specific, and panic if it is called from v2 code (since this
is an internal function, the panic would mean incorrect runc code).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>