Export getIntelRdtRoot function as Root.
This is needed by google/cadvisor, which is (ab)using GetIntelRdtPath,
removed by commit 7296dc1712.
While at it, do some minimal refactoring to always use Root()
internally, not relying on variable value. Other than that it's just
some renaming.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Only some libcontainer packages can be built on non-linux platforms
(not that it make sense, but at least go build succeeds). Let's call
these "good" packages.
For all other packages (i.e. ones that fail to build with GOOS other
than linux), it does not make sense to have linux build tag (as they
are broken already, and thus are not and can not be used on anything
other than Linux).
Remove linux build tag for all non-"good" packages.
This was mostly done by the following script, with just a few manual
fixes on top.
function list_good_pkgs() {
for pkg in $(find . -type d -print); do
GOOS=freebsd go build $pkg 2>/dev/null \
&& GOOS=solaris go build $pkg 2>/dev/null \
&& echo $pkg
done | sed -e 's|^./||' | tr '\n' '|' | sed -e 's/|$//'
}
function remove_tag() {
sed -i -e '\|^// +build linux$|d' $1
go fmt $1
}
SKIP="^("$(list_good_pkgs)")"
for f in $(git ls-files . | grep .go$); do
if echo $f | grep -qE "$SKIP"; then
echo skip $f
continue
fi
echo proc $f
remove_tag $f
done
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use the term "clos group" instead of "container_id group" as the group
that a container belongs to is not necessarily tied to its container id.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Check that the ClosID directory pre-exists if no L3 or MB schema has
been specified. Conform with the following line from runtime-spec
(config-linux):
If closID is set, and neither of l3CacheSchema and memBwSchema are
set, runtime MUST check if corresponding pre-configured directory
closID is present in mounted resctrl. If such pre-configured directory
closID exists, runtime MUST assign container to this closID and
generate an error if directory does not exist.
Add a TODO note for verifying existing schemata against L3/MB
parameters.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
Handle ClosID parameter of IntelRdt. Makes it possible to use
pre-configured classes/ClosIDs and avoid running out of available IDs
which easily happens with per-container classes.
Remove validator checks for empty L3CacheSchema and MemBwSchema fields
in order to be able to leave them empty, and only specify ClosID for
a pre-configured class.
Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>
These are not used anywhere outside of the package
(I have also checked the only external user of the package
(github.com/google/cadvisor).
No changes other than changing the case. The following
identifiers are now private:
* IntelRdtTasks
* NewLastCmdError
* NewStats
Brought to you by gorename.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... the stack, so every caller will automatically benefit from it.
The only change that it causes is the user in
libcontainer/process_linux.go will get a better error message.
[v2: typo fix]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For errors that only have a string and an underlying error, using
fmt.Errorf with %w to wrap an error is sufficient.
In this particular case, the code is simplified, and now we have
unwrappable errors as a bonus (same could be achieved by adding
(*LastCmdError).Unwrap() method, but that's adding more code).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Initially, this was copied over from libcontainer/cgroups, where it made
sense as for cgroup v1 we have multiple controllers and mount points.
Here, we only have a single mount, so there's no need for the whole
type.
Replace all that with a simple error (which is currently internal since
the only user is our own test case).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case getIntelRdtData() returns an error, d is set to nil.
In case the error returned is of NotFoundError type (which happens
if resctlr mount is not found in /proc/self/mountinfo), the function
proceeds to call d.join(), resulting in a nil deref and a panic.
In practice, this never happens in runc because of the checks in
intelrdt() function in libcontainer/configs/validate, which raises
an error in case any of the parameters are set in config but
the IntelRTD itself is not available (that includes checking
that the mount point is there).
Nevertheless, the code is wrong, and can result in nil dereference
if some external users uses Apply on a system without resctrl mount.
Fix this by removing the exclusion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Do this for all errors except one from unix.*.
This fixes a bunch of errorlint warnings, like these
libcontainer/generic_error.go:25:15: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
if le, ok := err.(Error); ok {
^
libcontainer/factory_linux_test.go:145:14: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
lerr, ok := err.(Error)
^
libcontainer/state_linux_test.go:28:11: type assertion on error will fail on wrapped errors. Use errors.As to check for specific errors (errorlint)
_, ok := err.(*stateTransitionError)
^
libcontainer/seccomp/patchbpf/enosys_linux.go:88:4: switch on an error will fail on wrapped errors. Use errors.Is to check for specific errors (errorlint)
switch err {
^
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This one is tough as errorlint insists on using errors.Is, and the
latter is known to not work for Go 1.13 which we still support.
So, add a nolint annotation to suppress the warning, and a TODO to
address it later.
For intelrdt, we can do the same, but it is easier to reuse the very
same function from fscommon (note we can't use fscommon for other stuff
as it expects cgroupfs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
An errror from ioutil.WriteFile already contains file name, so there is
no need to duplicate that information.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This commit adjusts the file mode to use the latest golang style
and also changes the file mode value in accordance with default.
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Intel RDT sub-features can be selectively disabled or enabled by kernel
command line. See "rdt=" option details in kernel document:
https://www.kernel.org/doc/Documentation/admin-guide/kernel-parameters.txt
But Cache Monitoring Technology (CMT) feature is not correctly checked
in init() and getCMTNumaNodeStats() now. If CMT is disabled by kernel
command line (e.g., rdt=!cmt,mbmtotal,mbmlocal,l3cat,mba) while hardware
supports CMT, we may get following error when getting Intel RDT stats:
runc run c1
runc events c1
ERRO[0005] container_linux.go:200: getting container's Intel RDT stats
caused: open /sys/fs/resctrl/c1/mon_data/mon_L3_00/llc_occupancy: no
such file or directory
Fix CMT feature check in init() and GetStats() call paths.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
It might be a tad slower but it surely more correct and well maintained,
so it's better to use it than rely on a custom implementation which is
kind of hard to get entirely right.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The line we are parsing looks like this
> flags : fpu vme de pse <...>
so look for "flags" as a prefix, not substring.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The Err() method should be called after the Scan() loop, not inside it.
Found by
git grep -A3 -F '.Scan()' | grep Err
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This patch fixes a corner case when destroy a container:
If we start a container without 'intelRdt' config set, and then we run
“runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config
implicitly.
Now if we enter "exit" from the container inside, we will pass through
linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy().
But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null
string, it hasn’t been initialized yet. As a result, the created rdt
group directory during "runc update" will not be removed as expected.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
MBA Software Controller feature is introduced in Linux kernel v4.18.
It is a software enhancement to mitigate some limitations in MBA which
describes in kernel documentation. It also makes the interface more user
friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per
second) as well as in "percentages".
The kernel underneath would use a software feedback mechanism or a
"Software Controller" which reads the actual bandwidth using MBM
counters and adjust the memory bandwidth percentages to ensure:
"actual memory bandwidth < user specified memory bandwidth".
We could enable this feature through mount option "-o mba_MBps":
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
In runc, we handle both memory bandwidth schemata in unified format:
"MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The unit of memory bandwidth is specified in "percentages" by default,
and in "MBps" if MBA Software Controller is enabled.
For more information about Intel RDT and MBA Software Controller:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Linux kernel v4.15 introduces better diagnostics for Intel RDT operation
errors. If any error returns when making new directories or writing to
any of the control file in resctrl filesystem, reading file
/sys/fs/resctrl/info/last_cmd_status could provide more information that
can be conveyed in the error returns from file operations.
Some examples:
echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata
-bash: echo: write error: Invalid argument
cat /sys/fs/resctrl/info/last_cmd_status
mask f3 has non-consecutive 1-bits
echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata
-bash: echo: write error: Invalid argument
cat /sys/fs/resctrl/info/last_cmd_status
MB value 0 out of range [10,100]
cd /sys/fs/resctrl
mkdir 1 2 3 4 5 6 7 8
mkdir: cannot create directory '8': No space left on device
cat /sys/fs/resctrl/info/last_cmd_status
out of CLOSIDs
See 'last_cmd_status' for more details in kernel documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
In runc, we could append the diagnostics information to the error
message of Intel RDT operation errors to provide more user-friendly
information.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Double check if Intel RDT sub-features are available in "resource
control" filesystem. Intel RDT sub-features can be selectively disabled
or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and
newer kernel.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This fixes a GetStats() issue introduced in #1590:
If Intel RDT is enabled by hardware and kernel, but intelRdt is not
specified in original config, GetStats() will return error unexpectedly
because we haven't called Apply() to create intelrdt group or attach
tasks for this container. As a result, runc events command will have no
output.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
In current implementation:
Either Intel RDT is not enabled by hardware and kernel, or intelRdt is
not specified in original config, we don't init IntelRdtManager in the
container to handle intelrdt constraint. It is a tradeoff that Intel RDT
has hardware limitation to support only limited number of groups.
This patch makes a minor change to support update command:
Whether or not intelRdt is specified in config, we always init
IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is
not specified in original config, we just don't Apply() to create
intelrdt group or attach tasks for this container.
In update command, we could re-enable through IntelRdtManager.Apply()
and then update intelrdt constraint.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This is the follow-up PR of #1279 to fix remaining issues:
Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>