Commit Graph

109 Commits

Author SHA1 Message Date
Kir Kolyshkin
dbd990d555 libct: rm intelrtd.Manager interface, NewIntelRdtManager
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.

Rename intelRdtManager to Manager, and modify its users accordingly.

Remove NewIntelRdtManager from factory.

Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.

Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.

Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 17:33:03 -08:00
Kir Kolyshkin
39bd7b7217 libct: Container, Factory: rm newuidmap/newgidmap
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).

Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Kir Kolyshkin
0d21515038 libct: remove Validator interface
We only have one implementation of config validator, which is always
used. It makes no sense to have Validator interface.

Having validate.Validator field in Factory does not make sense for all
the same reasons.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Kir Kolyshkin
630c0d7e8c libct: Container, Factory: rm InitPath, InitArgs
Those are *always* /proc/self/exe init, and it does not make sense
to ever change these. More to say, if InitArgs option func (removed
by this commit) is used to change these parameters, it will break
things, since "init" is hardcoded elsewhere.

Remove this.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-02-03 11:40:29 -08:00
Akihiro Suda
e9190d3ae1 Merge pull request #3353 from kolyshkin/rm-criu-opt
runc: remove --criu option
2022-02-01 08:28:14 +09:00
Kir Kolyshkin
6e1d476aad runc: remove --criu option
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.

The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.

Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.

Add a warning that --criu is ignored and will be removed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 20:25:56 -08:00
Kir Kolyshkin
485e6c84e7 Fix some revive warnings
This is needed since the future commits will touch this code, and then
the lint-extra CI job complains.

> libcontainer/factory.go#L245
> var-naming: var fdsJson should be fdsJSON (revive)

and

> libcontainer/init_linux.go#L181
> error-strings: error strings should not be capitalized or end with punctuation or a newline (revive)

and

> notify_socket.go#L94
> receiver-naming: receiver name n should be consistent with previous receiver name s for notifySocket (revive)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 19:14:14 -08:00
Mrunal Patel
403cda19e4 Merge pull request #3326 from kolyshkin/go118beta
ci: add go 1.18beta1
2022-01-25 16:01:25 -08:00
Kir Kolyshkin
907aefd43c libct: StartInitialization: fix %w related warning
(on Go 1.18 this is actually an error)

> libcontainer/factory_linux.go:341:10: fmt.Errorf format %w has arg e of wrong type interface{}

Unfortunately, fixing it results in an errorlint warning:

> libcontainer/factory_linux.go#L344 non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)

so we have to silence that one.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-12-15 17:20:09 -08:00
Kir Kolyshkin
024adbb1b9 libct: Create: rm unneeded chown
Commit 7cfb107f2c started using unix.Geteuid(), unix.Getegid() as
uid/gid argument for chown. It seems that it should have removed chown
entirely, since, according to mkdir(2),

> The newly created directory will be owned by the effective user ID of
> the process. If the directory containing the file has the
> set-group-ID bit set, or if the filesystem is mounted with BSD
> group semantics (mount -o bsdgroups or, synonymously mount -o grpid),
> the new directory will inherit the group ownership from its parent;
> otherwise it will be > owned by the effective group ID of the process.

So, the only effect of the chown after mkdir is ignoring the sgid bit
on the parent directory (which is probably not the right thing to do).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-12-10 18:49:25 -08:00
Kir Kolyshkin
b357bc1349 libct/factory: rm id param from loadState
It is not used since commit e918d0213.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 20:10:22 -08:00
Kir Kolyshkin
68c2b6a7d9 runc run: refuse a frozen cgroup
Sometimes a container cgroup already exists but is frozen.
When this happens, runc init hangs, and it's not clear what is going on.

Refuse to run in a frozen cgroup; add a test case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-03 20:36:27 -07:00
Kir Kolyshkin
d08bc0c1b3 runc run: warn on non-empty cgroup
Currently runc allows multiple containers to share the same cgroup (for
example, by having the same cgroupPath in config.json). While such
shared configuration might be OK, there are some issues:

 - When each container has its own resource limits, the order of
   containers start determines whose limits will be effectively applied.

 - When one of containers is paused, all others are paused, too.

 - When a container is paused, any attempt to do runc create/run/exec
   end up with runc init stuck inside a frozen cgroup.

 - When a systemd cgroup manager is used, this becomes even worse -- such
   as, stop (or even failed start) of any container results in
   "stopTransientUnit" command being sent to systemd, and so (depending on
   unit properties) other containers can receive SIGTERM, be killed after a
   timeout etc.

Any of the above may lead to various hard-to-debug situations in production
(runc init stuck, cgroup removal error, wrong resource limits, init not
reaping zombies etc.).

One obvious solution is to refuse a non-empty cgroup when starting a new
container. This would be a breaking change though, so let's make it in
steps, with the first step is issue a warning and a deprecated notice
about a non-empty cgroup.

Later (in runc 1.2) we will replace this warning with an error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-03 20:36:27 -07:00
Alban Crequy
9c444070ec Open bind mount sources from the host userns
The source of the bind mount might not be accessible in a different user
namespace because a component of the source path might not be traversed
under the users and groups mapped inside the user namespace. This caused
errors such as the following:

  # time="2020-06-22T13:48:26Z" level=error msg="container_linux.go:367:
  starting container process caused: process_linux.go:459:
  container init caused: rootfs_linux.go:58:
  mounting \"/tmp/busyboxtest/source-inaccessible/dir\"
  to rootfs at \"/tmp/inaccessible\" caused:
  stat /tmp/busyboxtest/source-inaccessible/dir: permission denied"

To solve this problem, this patch performs the following:

1. in nsexec.c, it opens the source path in the host userns (so we have
   the right permissions to open it) but in the container mntns (so the
   kernel cross mntns mount check let us mount it later:
   https://github.com/torvalds/linux/blob/v5.8/fs/namespace.c#L2312).

2. in nsexec.c, it passes the file descriptors of the source to the
   child process with SCM_RIGHTS.

3. In runc-init in Golang, it finishes the mounts while inside the
   userns even without access to the some components of the source
   paths.

Passing the fds with SCM_RIGHTS is necessary because once the child
process is in the container mntns, it is already in the container userns
so it cannot temporarily join the host mntns.

This patch uses the existing mechanism with _LIBCONTAINER_* environment
variables to pass the file descriptors from runc to runc init.

This patch uses the existing mechanism with the Netlink-style bootstrap
to pass information about the list of source mounts to nsexec.c.

Rootless containers don't use this bind mount sources fdpassing
mechanism because we can't setns() to the target mntns in a rootless
container (we don't have the privileges when we are in the host userns).

This patch takes care of using O_CLOEXEC on mount fds, and close them
early.

Fixes: #2484.

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-10-12 15:13:45 +02:00
Kir Kolyshkin
097c6d7425 libct/cg: simplify getting cgroup manager
1. Make Rootless and Systemd flags part of config.Cgroups.

2. Make all cgroup managers (not just fs2) return error (so it can do
   more initialization -- added by the following commits).

3. Replace complicated cgroup manager instantiation in factory_linux
   by a single (and simple) libcontainer/cgroups/manager.New() function.

4. getUnifiedPath is simplified to check that only a single path is
   supplied (rather than checking that other paths, if supplied,
   are the same).

[v2: can't -> cannot]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-23 09:11:44 -07:00
Alban Crequy
2b025c0173 Implement Seccomp Notify
This commit implements support for the SCMP_ACT_NOTIFY action. It
requires libseccomp-2.5.0 to work but runc still works with older
libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY
action.

A new synchronization step between runc[INIT] and runc run is introduced
to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get
from the runc[INIT] process and sends it to the seccomp agent using
SCM_RIGHTS.

As suggested by @kolyshkin, we also make writeSync() a wrapper of
writeSyncWithFd() and wrap the error there. To avoid pointless errors,
we made some existing code paths just return the error instead of
re-wrapping it. If we don't do it, error will look like:

	writing syncT <act>: writing syncT: <err>

By adjusting the code path, now they just look like this
	writing syncT <act>: <err>

Signed-off-by: Alban Crequy <alban@kinvolk.io>
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>
2021-09-07 13:04:24 +02:00
xiadanni
64358c4de9 optimize log: move WriteJSON defer as early as possible
if function returns error before WriteJSON defer, error will not be
printed out, so move this defer as early as possible and use logrus to
print out error if returns before it.

Signed-off-by: xiadanni <xiadanni1@huawei.com>
2021-09-07 07:00:57 +08:00
Akihiro Suda
bd75bc2dc6 Merge pull request #3176 from kolyshkin/rm-config-error-alt
libct/error.go: rm ConfigError (alt)
2021-09-02 14:34:32 +09:00
Kir Kolyshkin
9ff64c3d97 *: rm redundant linux build tag
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.

In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:00 -07:00
Kir Kolyshkin
538ba846dd libct/error.go: rm ConfigError
ConfigError was added by commit e918d02139, while removing runc own
error system, to preserve a way for a libcontainer user to distinguish
between a configuration error and something else.

The way ConfigError is implemented requires a different type of check
(compared to all other errors defined by error.go). An attempt was made
to rectify this, but the resulting code became even more complicated.

As no one is using this functionality (of differentiating a "bad config"
type of error from other errors), let's just drop the ConfigError type.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-23 18:56:08 -07:00
Kir Kolyshkin
e918d02139 libcontainer: rm own error system
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.

Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.

While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".

A few things that are worth mentioning:

1. We lose stack traces (which were never shown anyway).

2. Users of libcontainer that relied on particular errors (like
   ContainerNotExists) need to switch to using errors.Is with
   the new errors defined in error.go.

3. encoding/json is unable to unmarshal the built-in error type,
   so we have to introduce initError and wrap the errors into it
   (basically passing the error as a string). This is the same
   as it was before, just a tad simpler (actually the initError
   is a type that got removed in commit afa844311; also suddenly
   ierr variable name makes sense now).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-24 10:21:04 -07:00
Kir Kolyshkin
a7cfb23b88 *: stop using pkg/errors
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Kir Kolyshkin
36aefad45d libct: wrap unix.Mount/Unmount errors
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.

Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:37 -07:00
Kir Kolyshkin
563225d55f libct/StartInitialization: fix errors
Errors from strconv.Atoi are already descriptive enough, and contain the
value being converted, so our error messages do not need to contain it.

While at it, use %w to wrap errors.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 11:43:44 -07:00
Kir Kolyshkin
627a06ad92 Replace fmt.Errorf w/o %-style to errors.New
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.

Found by

	git grep fmt.Errorf | grep -v ^vendor | grep -v '%'

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 11:42:07 -07:00
Kir Kolyshkin
e6048715e4 Use gofumpt to format code
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.

Brought to you by

	git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w

Looking at the diff, all these changes make sense.

Also, replace gofmt with gofumpt in golangci.yml.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-01 12:17:27 -07:00
Sebastiaan van Stijn
b45fbd43b8 errcheck: libcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-05-20 14:19:26 +02:00
Kir Kolyshkin
201d60c51d runc run/start/exec: fix init log forwarding race
Sometimes debug.bats test cases are failing like this:

> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> #   `[[ "${output}" == *"child process in init()"* ]]' failed

It happens more when writing to disk.

This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.

ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for arbitrarily long
time before doing execve.

So, to fix the failure described above, we need to:

 1. Make runc create/run/exec wait for ForwardLogs to finish;

 2. Make runc init close its log pipe file descriptor (i.e.
    the one which value is passed in _LIBCONTAINER_LOGPIPE
    environment variable).

This is exactly what this commit does:

 1. Amend ForwardLogs to return a channel, and wait for it in start().

 2. In runc init, save the log fd and close it as late as possible.

PS I have to admit I still do not understand why an explicit close of
log pipe fd is required in e.g. (*linuxSetnsInit).Init, right before
the execve which (thanks to CLOEXEC) closes the fd anyway.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-03-25 19:18:55 -07:00
Kenta Tada
8a3484b736 libcontainer/factory*: adjust the file mode
This commit adjusts the file mode to use the latest golang style
Related to #2625

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2021-03-02 16:26:36 +09:00
Iceber Gu
916654fffa libcontainer: fix LinuxFactory comments
Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io>
2021-02-24 15:51:06 +08:00
Xiaochen Shen
325a74ddec libcontainer/intelrdt: rm init() from intelrdt.go
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.

Simplify IntelRdtManager initialization in LinuxFactory.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-12-16 23:37:31 +08:00
Xiaochen Shen
f62ad4a0de libcontainer/intelrdt: rename CAT and MBA enabled flags
Rename CAT and MBA enabled flags to be consistent with others.
No functional change.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-11-10 15:32:01 +08:00
Xiaochen Shen
2c004a101e libcontainer/intelrdt: introduce NewManager()
Introduce NewManager() to wrap up IntelRdtManager initialization. And
call it when required.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2020-10-17 16:59:20 +08:00
Jeff Zvier
1e9f80598a cleancode: adjust and make it more readability
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
2020-08-05 23:03:42 +08:00
Kir Kolyshkin
d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Kir Kolyshkin
5935bf8c21 libct/cgroups/fs: introduce NewManager()
...and use it from libcontainer/factory_linux.go.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:05 -07:00
Akihiro Suda
bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
Kir Kolyshkin
60eaed2ed6 cgroupv2: move sanity path check to common code
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin
9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Kir Kolyshkin
c7ab2c036b libcontainer: switch to moby/sys/mountinfo package
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Akihiro Suda
88e8350de2 cgroup2: split fs2 from fs
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.

Inspired by containerd/cgroups#109

Fix #2157

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:42:10 +09:00
Filipe Brandenburger
46351eb3d1 Move systemd.Manager initialization into a function in that module
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.

Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-05-01 13:22:19 -07:00
Xiaochen Shen
27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Akihiro Suda
06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Lifubang
4eb30fcdbe code optimization: use securejoin.SecureJoin and CleanPath
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-04 09:02:18 +08:00
Lifubang
4fae8fcce2 code optimization after review
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-03 23:27:31 +08:00
Lifubang
d2d226e8f9 fix unexpected delete bug when container id is ..
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-08-31 11:17:42 +08:00
Aleksa Sarai
03e585985f rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Vincent Demeester
3ca4c78b1a Import docker/docker/pkg/mount into runc
This will help get rid of docker/docker dependency in runc 👼

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-11-08 16:25:58 +01:00