Commit Graph

26 Commits

Author SHA1 Message Date
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Kir Kolyshkin
2192670a24 libct/configs/validate: validate mounts
Add a check that mount destination is absolute (as per OCI spec).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-04-20 11:48:44 -07:00
Akihiro Suda
913b9f14e8 Merge pull request #2886 from thaJeztah/check_cleanup 2021-04-06 03:28:45 +09:00
Sebastiaan van Stijn
f1586dbd7a libcontainer/configs/validate: make Validate() less DRY
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-02 11:41:19 +02:00
Kir Kolyshkin
b118430231 libct/configs/validator: add some cgroup support
Add some minimal validation for cgroups. The following checks
are implemented:

 - cgroup name and/or prefix (or path) is set;
 - for cgroup v1, unified resources are not set;
 - for cgroup v2, if memorySwap is set, memory is also set,
   and memorySwap > memory.

This makes some invalid configurations fail earlier (before runc init
is started), which is better.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-31 14:36:52 -07:00
Kir Kolyshkin
8e8661e124 libct/configs/validate/sysctl: fix repeated netns checks
In case many net.* sysctls are provided, and we're not running
in the host netns, the function keep repeating isNetNS check
for every such sysctl. This is a waste of resources.

Do the isNetNS check only once, and only if needed.

Note that using sync.Once() is not really needed here; we could
have used a boolean variable to skip the repeated check, but
it looks more idiomatic that way.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-12-18 09:32:39 -08:00
Kir Kolyshkin
2dce06995a libct/configs/validate: fix host netns check
In case nsfs mount (such as /run/docker/netns/xxxx) is provided as
the netns path, the current way of determining whether path is of
host netns or not is not working.

The proper way to check is to do stat(2) and compare dev_t and
inode fields, which is what this commit does.

This is a minimal fix which does not try to optimize repeated
check in case more than one net.* sysctl is given and there is
no error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-12-18 09:29:13 -08:00
Xiaochen Shen
f62ad4a0de libcontainer/intelrdt: rename CAT and MBA enabled flags
Rename CAT and MBA enabled flags to be consistent with others.
No functional change.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-11-10 15:32:01 +08:00
Amim Knabben
978fa6e906 Fixing some lint issues
Signed-off-by: Amim Knabben <amim.knabben@gmail.com>
2020-10-06 14:44:14 -04:00
John Hwang
7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
Yuanhong Peng
df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Xiaochen Shen
27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Akihiro Suda
06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Aleksa Sarai
823c06eae9 libcontainer: improve "kernel.{domainname,hostname}" sysctl handling
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-06-18 21:48:04 +10:00
Xiaochen Shen
88d22fde40 libcontainer: intelrdt: use init() to avoid race condition
This is the follow-up PR of #1279 to fix remaining issues:

Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-08 17:15:31 +08:00
Xiaochen Shen
692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Daniel, Dao Quang Minh
13a8c5d140 Merge pull request #1365 from hqhq/use_go_selinux
Use opencontainers/selinux package
2017-04-15 14:22:32 +01:00
Aleksa Sarai
d2f49696b0 runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Qiang Huang
5e7b48f7c0 Use opencontainers/selinux package
It's splitted as a separate project.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-23 08:21:19 +08:00
Samuel Ortiz
f19aa2d04d validate: Check that the given namespace path is a symlink
When checking if the provided networking namespace is the host
one or not, we should first check if it's a symbolic link or not
as in some cases we can use persistent networking namespace under
e.g. /var/run/netns/.

Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
2016-12-10 11:14:49 +01:00
Qiang Huang
81d6088c8f Unify rootfs validation
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-10-29 10:31:44 +08:00
Aleksa Sarai
2a94c3651b validator: unbreak sysctl net.* validation
When changing this validation, the code actually allowing the validation
to pass was removed. This meant that any net.* sysctl would always fail
to validate.

Fixes: bc84f83344 ("fix docker/docker#27484")
Reported-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-26 22:58:51 +11:00
Ce Gao
bc84f83344 fix docker/docker#27484
Signed-off-by: Ce Gao <ce.gao@outlook.com>
2016-10-22 11:22:52 +08:00
rajasec
d0bf80e481 Adding selinux check during container start
Signed-off-by: rajasec <rajasec79@gmail.com>

Fixed review comments and rebased

Signed-off-by: rajasec <rajasec79@gmail.com>

updated the message as per review comment

Signed-off-by: Rajasekaran <rajasec79@gmail.com>
2016-04-19 22:22:04 +05:30
Mrunal Patel
5640330693 Fix for runc failing when rootfs has a traling slash
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-04-11 09:50:28 -07:00
Alberto Leal
dca2d12760 Add unit tests for validate.Validator
Signed-off-by: Alberto Leal <albertonb@gmail.com>
2016-04-06 11:18:11 +01:00