Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.
Modify all the users and the docs accordingly.
While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The value of root is already an absolute path since commit
ede8a86ec1, so it does not make sense to call filepath.Abs()
again.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were never used (ctx was added by the initial commit, and
error was added by commit 25fd4a6757).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.
* -- in fact restore does not call revisePidFile, but it should.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of passing _LIBCONTAINER_LOGLEVEL as a string
(like "debug" or "info"), use a numeric value.
Also, simplify the init log level passing code -- since we actually use
the same level as the runc binary, just get it from logrus.
This is a preparation for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This fixes the following failure:
> sudo runc run -b bundle ctr </dev/null
> WARN[0000] exit status 2
> ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:459: container init caused:
The "exit status 2" with no error message is caused by SIGHUP
which is sent to init by the kernel when we are losing the
controlling terminal. If we choose to ignore that, we'll get
panic in console.Current(), which is addressed by [1].
Otherwise, the issue here is simple: the code assumes stdin
is opened to a terminal, and fails to work otherwise. Some
standard Linux tools (e.g. stty, top) do the same (modulo panic),
while some others (reset, tput) use the trick of trying
all the three std streams (starting with stderr as it is least likely
to be redirected), and if all three fails, open /dev/tty.
This commit does a similar thing (see initHostConsole).
It also replaces the call to console.Current(), which may panic
(see [1]), by reusing the t.hostConsole.
Finally, a simple test case is added.
Fixes: https://github.com/opencontainers/runc/issues/2485
[1] https://github.com/containerd/console/pull/37
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
if NOTIFY_SOCKET is used, do not block the main runc process waiting
for events on the notify socket. Bind mount the parent directory of
the notify socket, so that "start" can create the socket and it is
still accessible from the container.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.
This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.
An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.
Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>