Commit 770728e1 added Scheduler field into both Config and Process,
but forgot to add a mechanism to actually use Process.Scheduler.
As a result, runc exec does not set Process.Scheduler ever.
Fix it, and a test case (which fails before the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit bfbd0305b added IOPriority field into both Config and Process,
but forgot to add a mechanism to actually use Process.IOPriority.
As a result, runc exec does not set Process.IOPriority ever.
Fix it, and a test case (which fails before the fix).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is one of the dark corners of runc / libcontainer, so let's shed
some light on it.
initConfig is a structure which is filled in [mostly] by newInitConfig,
and one of its hidden aspects is it contains a process config which is
the result of merge between the container and the process configs.
Let's document how all this happens, where the fields are coming from,
which one has a preference, and how it all works.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):
> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.
Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).
What happens next is:
1. runc start/run/exec converts those credentials to strings (a User
string containing "UID:GID", and a []string for additional GIDs) and
passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
possible (parsing container's /etc/passwd and /etc/group).
All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).
The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).
Let's remove this back and forth translation, hopefully increasing
runc exec performance.
The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).
PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.
Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.
For 3998.
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is a typo in the comment (ClonedBinary should be CloneBinary), and
the code has changed a bit since then, and it makes more sense to refer
to CloneSelfExe now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.
And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This field used to hold a string representation of log level (like
"debug" or "info"). Since commit 6c4a3b13d1 this is now a string
holding a numeric representation of log level (e.g. "4"). This was done
in preparation to commit f1b703fc45, and simplifies things in
a few places.
Let's document it.
Fixes: 6c4a3b13d1
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using fmt.Errorf for errors that do not have %-style formatting
directives is an overkill. Switch to errors.New.
Found by
git grep fmt.Errorf | grep -v ^vendor | grep -v '%'
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.
This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
This fixes all of the tests that were broken as part of the console
rewrite. This includes fixing the integration tests that used TTY
handling inside libcontainer, as well as the bats integration tests that
needed to be rewritten to use recvtty (as they rely on detached
containers that are running).
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.
In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.
We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In user namespaces, we need to make sure we don't chown() the console to
unmapped users. This means we need to get both the UID and GID of the
root user in the container when changing the owner.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>