Commit Graph

90 Commits

Author SHA1 Message Date
Kir Kolyshkin
6a3fe1618f libcontainer: remove LinuxFactory
Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.

Modify all the users and the docs accordingly.

While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-03-22 23:44:31 -07:00
Kir Kolyshkin
40b0088681 loadFactory: remove
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-18 16:05:29 -08:00
Kir Kolyshkin
36786c361a list, utils: remove redundant code
The value of root is already an absolute path since commit
ede8a86ec1, so it does not make sense to call filepath.Abs()
again.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-18 16:05:29 -08:00
Kir Kolyshkin
dbd990d555 libct: rm intelrtd.Manager interface, NewIntelRdtManager
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.

Rename intelRdtManager to Manager, and modify its users accordingly.

Remove NewIntelRdtManager from factory.

Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.

Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.

Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 17:33:03 -08:00
Kir Kolyshkin
39bd7b7217 libct: Container, Factory: rm newuidmap/newgidmap
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).

Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Kir Kolyshkin
6e1d476aad runc: remove --criu option
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.

The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.

Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.

Add a warning that --criu is ignored and will be removed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 20:25:56 -08:00
Kir Kolyshkin
86733013cc notify_socket: setupSpec: drop ctx arg and return value
Those were never used (ctx was added by the initial commit, and
error was added by commit 25fd4a6757).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
3648346572 tty: ClosePostStart: rm return value
It is not and was not ever used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
f3f4b6d155 tty: recvtty: rm process arg
It is not used since commit 00a0ecf554.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
e63186351b tty: rm inheritStdio return value
Since commit eebdb644f9 this function never returns any error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
d23b810927 checkpoint: rm getDefaultImagePath arg
It was never used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
0202c398ff runc exec: implement --cgroup
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-27 10:25:42 -07:00
Kir Kolyshkin
097c6d7425 libct/cg: simplify getting cgroup manager
1. Make Rootless and Systemd flags part of config.Cgroups.

2. Make all cgroup managers (not just fs2) return error (so it can do
   more initialization -- added by the following commits).

3. Replace complicated cgroup manager instantiation in factory_linux
   by a single (and simple) libcontainer/cgroups/manager.New() function.

4. getUnifiedPath is simplified to check that only a single path is
   supplied (rather than checking that other paths, if supplied,
   are the same).

[v2: can't -> cannot]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-23 09:11:44 -07:00
Kir Kolyshkin
9ba2f65d6b startContainer: minor refactor
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.

* -- in fact restore does not call revisePidFile, but it should.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-14 10:53:11 -07:00
Kir Kolyshkin
6c4a3b13d1 runc init: pass _LIBCONTAINER_LOGLEVEL as int
Instead of passing _LIBCONTAINER_LOGLEVEL as a string
(like "debug" or "info"), use a numeric value.

Also, simplify the init log level passing code -- since we actually use
the same level as the runc binary, just get it from logrus.

This is a preparation for the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-09 14:57:20 -07:00
Kir Kolyshkin
0a3577c680 utils_linux: simplify newProcess
newProcess do not need those extra arguments, they can be handled
in the caller.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-09 14:57:20 -07:00
Akihiro Suda
5fb9b2a006 Merge pull request #3185 from kolyshkin/go117-build-tags
Add go:build tags
2021-09-02 13:35:33 +09:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
lifubang
cb824629ba proposal: add --keep to runc run
Signed-off-by: lifubang <lifubang@acmcoder.com>
2021-08-02 12:51:36 -07:00
Kir Kolyshkin
a7cfb23b88 *: stop using pkg/errors
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Kir Kolyshkin
719d70d2e3 setupIO: simplify code
There's no need to check err for nil.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-01-07 13:33:41 -08:00
Xiaochen Shen
325a74ddec libcontainer/intelrdt: rm init() from intelrdt.go
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.

Simplify IntelRdtManager initialization in LinuxFactory.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-12-16 23:37:31 +08:00
Xiaochen Shen
f62ad4a0de libcontainer/intelrdt: rename CAT and MBA enabled flags
Rename CAT and MBA enabled flags to be consistent with others.
No functional change.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-11-10 15:32:01 +08:00
Sebastiaan van Stijn
8bf216728c use string-concatenation instead of sprintf for simple cases
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-30 10:51:59 +02:00
Kir Kolyshkin
b79cb04886 runc run/exec: fix terminal wrt stdin redirection
This fixes the following failure:

> sudo runc run -b bundle ctr </dev/null
> WARN[0000] exit status 2
> ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:459: container init caused:

The "exit status 2" with no error message is caused by SIGHUP
which is sent to init by the kernel when we are losing the
controlling terminal. If we choose to ignore that, we'll get
panic in console.Current(), which is addressed by [1].

Otherwise, the issue here is simple: the code assumes stdin
is opened to a terminal, and fails to work otherwise. Some
standard Linux tools (e.g. stty, top) do the same (modulo panic),
while some others (reset, tput) use the trick of trying
all the three std streams (starting with stderr as it is least likely
to be redirected), and if all three fails, open /dev/tty.

This commit does a similar thing (see initHostConsole).

It also replaces the call to console.Current(), which may panic
(see [1]), by reusing the t.hostConsole.

Finally, a simple test case is added.

Fixes: https://github.com/opencontainers/runc/issues/2485

[1] https://github.com/containerd/console/pull/37

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-08-20 07:56:20 -07:00
zvier
92e2175de1 cleancode: clean code for utils_linux.go
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
2020-07-23 06:12:27 +08:00
John Hwang
5aa0601a59 validateProcessSpec: prevent SEGV when config is valid json, but invalid.
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-18 09:38:22 -07:00
John Hwang
7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
Kir Kolyshkin
2b31437caa Merge pull request #2281 from AkihiroSuda/rootless-systemd
cgroup v2: support rootless systemd

LGTMs: kolyshkin, mrunalp
2020-05-07 21:45:52 -07:00
Akihiro Suda
bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
Kir Kolyshkin
c52a598d74 Remove fatalf()
It was only used in one place, all others are happy with
`fatal(fmt.Errorf())`.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-02 16:19:14 -07:00
Mrunal Patel
33c6125da6 systemd: Export IsSystemdRunning() function
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Akihiro Suda
cc183ca662 Merge pull request #2242 from AkihiroSuda/vendor-systemd
vendor: update go-systemd and godbus
2020-03-25 02:40:22 +09:00
Ted Yu
0a7762c664 Avoid duplicate calls to runner#destroy
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-03-23 09:04:38 -07:00
Akihiro Suda
492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Giuseppe Scrivano
25fd4a6757 sd-notify: do not hang when NOTIFY_SOCKET is used with create
if NOTIFY_SOCKET is used, do not block the main runc process waiting
for events on the notify socket.  Bind mount the parent directory of
the notify socket, so that "start" can create the socket and it is
still accessible from the container.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-03-12 21:21:05 +01:00
Mrunal Patel
eb4aeed24f Merge pull request #2038 from imxyb/defer-destroy
`r.destroy` can defer exec in `runner.run` method.
2019-05-07 15:48:14 -07:00
Georgi Sabev
ba3cabf932 Improve nsexec logging
* Simplify logging function
* Logs contain __FUNCTION__:__LINE__
* Bail uses write_log

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:53:52 +03:00
Xiao YongBiao
da5a2dd456 r.destroy can defer exec in runner.run method.
Signed-off-by: Xiao YongBiao <xyb4638@gmail.com>
2019-04-10 23:25:03 +08:00
lifubang
3e6688f5c9 add selinux label for runc exec
Signed-off-by: lifubang <lifubang@acmcoder.com>
2019-04-03 12:09:06 +08:00
lifubang
7cb3cde1f4 fix preserve-fds flag may cause runc hang
Signed-off-by: lifubang <lifubang@acmcoder.com>
2019-03-01 17:15:17 +08:00
Xiaochen Shen
27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Akihiro Suda
06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Mrunal Patel
bd3c4f844a Fix race in runc exec
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.

This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-06-01 16:25:58 -07:00
Akihiro Suda
63bb0fe9d0 Fix merge conflict
Caused by:
* #1688 0e561642f8
* #1759 dd67ab10d7

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-30 11:25:43 +09:00
Michael Crosby
0e561642f8 Merge pull request #1688 from AkihiroSuda/unshare-m-r
main: support rootless mode in userns
2018-05-29 15:41:17 -04:00
Akihiro Suda
cdb7f23d21 main: add condition to isRootless()
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:17:37 +09:00
Akihiro Suda
f103de57ec main: support rootless mode in userns
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that `unshare --mount` requires `--map-root-user`)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Aleksa Sarai
03e585985f rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00