This is a function to convert huge page sizes (obtained by reading
/sys/kernel/mm/hugepages directory entries) to strings user for hugetlb
cgroup controller resource files. Those strings are when used to get the
hugetlb resource statistics.
This function used external library, floating point numbers, and can
(theoretically) produce invalid values, since the kernel only uses KB,
MB, and GB suffixes.
Rewrite it to produce the same strings as used in the kernel (see [1]).
As a result, it's also faster, more future-proof (entries that do not
start with "hugepages-" and/or incorrect suffix are skipped), and does
more input sanity checks. As a side effect, libcontainer no longer
depends on docker/go-units.
While at it, add more test cases.
Before:
BenchmarkGetHugePageSize-8 187452 6265 ns/op
BenchmarkGetHugePageSizeImpl-8 396769 2998 ns/op
After:
BenchmarkGetHugePageSize-8 222898 4554 ns/op
BenchmarkGetHugePageSizeImpl-8 4738924 241 ns/op
NOTE on removing HugePageSizeUnitList -- this was added by commit
6f77e35da and was used by kubernetes code in [2], which was later
superceded by [3], so there are (hopefully) no external users.
If there are any, they should not be doing that.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?id=eff48ddeab782e35e58ccc8853f7386bbae9dec4#n574
[2] https://github.com/kubernetes/kubernetes/pull/78495
[3] https://github.com/kubernetes/kubernetes/pull/84154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
ioutil.ReadFile does a stat() on every entry and returns a slice of
os.Stat structures. What we need here is just a file name.
This change both simplifies and speeds up the code a bit.
Before:
BenchmarkGetHugePageSize-8 115213 9400 ns/op
After:
BenchmarkGetHugePageSize-8 190326 6187 ns/op
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using os.RemoveAll has the following two issues:
1. it tries to remove all files, which does not make sense for cgroups;
2. it tries rm(2) which fails to directories, and then rmdir(2).
Let's reuse our RemovePath instead, and add warnings and errors logging.
PS I am somewhat hesitant to remove the weird checking my means of stat,
as it might break something. Unfortunately, neither commit 6feb7bda04
nor the PR it contains [1] do not explain what kind of weird errors were
seen from os.RemoveAll. Most probably our code won't return any bogus
errors, but let's keep the old code to be on the safe side.
[1] https://github.com/docker-archive/libcontainer/pull/308
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
RemovePaths() deletes elements from the paths map for paths that has
been successfully removed.
Although, it does not empty the map itself (which is needed that AFAIK
Go garbage collector does not shrink the map), but all its callers do.
Move this operation from callers to RemovePaths.
No functional change, except the old map should be garbage collected now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.
There are no code changes in this commit, just moving it from one file
to another.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is cgroupv1-specific, is only used once, and its name
is very close to the name of another function, FindCgroupMountpoint.
Inline it into the (only) caller.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is only called from cgroupv1 code, so there is no need
for it to implement cgroupv2 stuff.
Make it v1-specific, and panic if it is called from v2 code (since this
is an internal function, the panic would mean incorrect runc code).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It's bad and wrong to use these functions for any cgroupv2 code,
and there are no existing users (in runc, at least).
Make them return an error in such case.
Also, remove the cgroupv2-specific handling from
findCgroupMountpointAndRootFromReader().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function should not really be used for cgroupv2 code.
Currently it is used in kubernetes code, so we can't remove
the v2 case yet.
Add a TODO item to remove v2 code once kubernetes is converted
to not use it, and separate out v1 code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is not used and were never used in any cgroupv2 code.
To have it stay that way, let it return error in case it's called
for v2.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... and mem+swap is not explicitly set otherwise.
This ensures compatibility with cgroupv1 controller which interprets
things this way.
With this fixed, we can finally enable swap tests for cgroupv2.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The function GetClosestMountpointAncestor is not very efficient,
does not really belong to cgroup package, and is only used once
(from fs/cpuset.go).
Remove it, replacing with the implementation based on moby/sys/mountinfo
parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is not very efficient, does not really belong to cgroup
package, and is only used once (from fs/cpuset.go).
Prepare to remove it by replacing with the implementation based on
the parser from github.com/moby/sys/mountinfo parser.
This commit is here to make sure the proposed replacement passes the
unit test.
Funny, but the unit test need to be slightly modified since it
supplies the wrong mountinfo (space as the first character, empty line
at the end).
Validated by
$ go test -v -run Ance
=== RUN TestGetClosestMountpointAncestor
--- PASS: TestGetClosestMountpointAncestor (0.00s)
PASS
ok github.com/opencontainers/runc/libcontainer/cgroups 0.002s
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.
Signed-off-by: lifubang <lifubang@acmcoder.com>
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).
Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.
[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.
The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().
This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Return earlier if there is an error.
2. Do not use filepath.Split on every entry, use info.Name() instead.
3. Make readProcsFile() accept file name as an argument, to avoid
unnecessary file name and directory splitting and merging.
4. Skip on info.IsDir() -- this avoids an error when cgroup name is
set to "cgroup.procs".
This is still not very good since filepath.Walk() performs an unnecessary
stat(2) on every entry, but better than before.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
* TestConvertCPUSharesToCgroupV2Value(0) was returning 70369281052672, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(0) was returning 32, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(1000) was returning 4, while the correct value is 10000
Fix#2244
Follow-up to #2212#2213
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.
subsystemsUnified is used on a system running with cgroups v2 unified
mode.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:
```
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "cgroup"
}
],
```
Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Fix duplicate entries and missing entries in getCgroupMountsHelper
Add test for testing cgroup mounts on bedrock linux
Stop relying on number of subsystems for cgroups
LGTMs: @crosbymichael @cyphar
Closes#1817
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.
Without the 'all' flag, ignore duplicate subsystems after the first.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Runc needs to copy certain files from the top of the cgroup cpuset hierarchy
into the container's cpuset cgroup directory. Currently, runc determines
which directory is the top of the hierarchy by using the parent dir of
the first entry in /proc/self/mountinfo of type cgroup.
This creates problems when cgroup subsystems are mounted arbitrarily in
different dirs on the host.
Now, we use the most deeply nested mountpoint that contains the
container's cpuset cgroup directory.
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Signed-off-by: Will Martin <wmartin@pivotal.io>