This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.
Mostly generated by:
git rm -f libcontainer/cgroups
find . -type f -name "*.go" -exec sed -i \
's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
{} +
go get github.com/opencontainers/cgroups@v0.0.1
make vendor
gofumpt -w .
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
`signalAllProcesses()` depends on the cgroup and is expected to fail
when runc is running in rootless without an access to the cgroup.
When `RootlessCgroups` is set to `true`, runc just ignores the error
from `signalAllProcesses` and may leak some processes running.
(See the comments in PR 4395)
In the future, runc should walk the process tree to avoid such a leak.
Note that `RootlessCgroups` is a misnomer; it is set to `false` despite
the name when cgroup v2 delegation is configured.
This is expected to be renamed in a separate commit.
Fix issue 4394
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
For some reason, container destroy operation removes container's state
directory even if cgroup removal fails (and then still returns an
error). It has been that way since commit 5c246d038f, which added
cgroup removal.
This is problematic because once the container state dir is removed, we
no longer know container's cgroup and thus can't remove it.
Let's return the error early and fail if cgroup can't be removed.
Same for other operations: do not proceed if we fail.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(For a container with no private PID namespace, that is).
When runc delete (or container.Destroy) is called on a stopped
container without private PID namespace and there are processes
in its cgroup, kill those.
Add a test case.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The semantics of runType is slightly complicated, and the only place
where we need to distinguish between Created and Running is
refreshState.
Replace runType with simpler hasInit, simplifying its users (except the
refreshState, which now figures out on its own whether the container is
Created or Running).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The name "root" (or "containerRoot") is confusing; one might think it is
the root of container's file system (the directory we chroot into).
Rename to stateDir for clarity.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When a hook has failed, the error message looks like this:
> error running hook: error running hook #1: exit status 1, stdout: ...
The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;
Change that to
> error running createContainer hook #1: exit status 1, stdout: ...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.
Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).
The solution to the above problem is killing all the container
processes, not just init.
The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
- func destroy(c *Container) calls signalAllProcesses;
- runc kill implements -a flag.
This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.
In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It's expect that signalAllProcesses is invoked when container shares
pid namespace. share pid ns contains the following conditions:
{
// no specify pid ns
}
{
"type": "pid",
"path": "/proc/${num}/ns/pid"
}
Signed-off-by: Shukui Yang <jryangshukui@jd.com>
Signed-off-by: Shukui Yang <keloyangsk@gmail.com>
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
`HookState` struct should follow definition of `State` in runtime-spec:
* modify json name of `version` to `ociVersion`.
* Remove redundant `Rootfs` field as rootfs can be retrived from
`bundlePath/config.json`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This allows a user to send a signal to all the processes in the
container within a single atomic action to avoid new processes being
forked off before the signal can be sent.
This is basically taking functionality that we already use being
`delete` and exposing it ok the `kill` command by adding a flag.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
By adding detach to runc the container process is the only thing running
on the system is the containers process.
This allows better usage of memeory and no runc process being long
lived. With this addition you also need a delete command because the
detached container will not be able to remove state and the left over
cgroups directories.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
There were issues where a process could die before pausing completed
leaving the container in an inconsistent state and unable to be
destoryed. This makes sure that if the container is paused and the
process is dead it will unfreeze the cgroup before removing them.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add state status() method
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Allow multiple checkpoint on restore
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Handle leave-running state
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Fix state transitions for inprocess
Because the tests use libcontainer in process between the various states
we need to ensure that that usecase works as well as the out of process
one.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Remove isDestroyed method
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Handling Pausing from freezer state
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
freezer status
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Fixing review comments
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Added comment when freezer not available
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Conflicts:
libcontainer/container_linux.go
Change checkFreezer logic to isPaused()
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Remove state base and factor out destroy func
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add unit test for state transitions
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>