While it does not make much sense practically, as runc is going to exit
soon and all fds will be closed anyway, various linters (including
SVACE) keep reporting this.
Let's make them happy.
Reported-by: Tigran Sogomonian <tsogomonian@astralinux.ru>
Reported-by: Mikhail Dmitrichenko <m.dmitrichenko222@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This issue was originally reported in podman PR 25792.
When calling runc pause/unpause for an ordinary user, podman do not
provide --systemd-cgroups option, and shouldUseRootlessCgroupManager
returns true. This results in a warning:
$ podman pause sleeper
WARN[0000] runc pause may fail if you don't have the full access to cgroups
sleeper
Actually, it does not make sense to call shouldUseRootlessCgroupManager
at this point, because we already know if we're rootless or not, from
the container state.json (same for systemd).
Also, busctl binary is not available either in this context, so
shouldUseRootlessCgroupManager would not work properly.
Finally, it doesn't really matter if we use systemd or not, because we
use fs/fs2 manager to freeze/unfreeze, and it will return something like
EPERM (or tell that cgroups is not configured, for a true rootless
container).
So, let's only print the warning after pause/unpause failed,
if the error returned looks like a permission error.
Same applies to "runc ps".
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As per
- https://github.com/opencontainers/runtime-spec/pull/1253
- https://github.com/opencontainers/runtime-spec/pull/1261
CPU affinity can be set in two ways:
1. When creating/starting a container, in config.json's
Process.ExecCPUAffinity, which is when applied to all execs.
2. When running an exec, in process.json's CPUAffinity, which
applied to a given exec and overrides the value from (1).
Add some basic tests.
Note that older kernels (RHEL8, Ubuntu 20.04) change CPU affinity of a
process to that of a container's cgroup, as soon as it is moved to that
cgroup, while newer kernels (Ubuntu 24.04, Fedora 41) don't do that.
Because of the above,
- it's impossible to really test initial CPU affinity without adding
debug logging to libcontainer/nsenter;
- for older kernels, there can be a brief moment when exec's affinity
is different than either initial or final affinity being set;
- exec's final CPU affinity, if not specified, can be different
depending on the kernel, therefore we don't test it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There is a big loop(at least 65 times) in `signal.Notify`, it costs as much
time as `runc init`, so we can call it in parallel ro reduce the container
start time. In a general test, it can be reduced about 38.70%.
Signed-off-by: lifubang <lifubang@acmcoder.com>
(cyphar: move signal channel definition inside goroutine)
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
1. Pass an argument as a pointer rather than copying the whole structure.
It was a pointer initially, but this has changed in commit b2d9d996
without giving a reason why.
2. The newProcess description was added by commit 9fac18329 (yes, the
very first one) and hasn't changed since. As of commit 29b139f7,
the part of it which says "and stdio from the current process"
is no longer valid.
Remove it, and while at it, rewrite the description entirely.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The rootuid and rootgid are only needed when detach and createTTY are
both false. We also call c.Config() twice, every time creating a copy
of struct Config.
Solve both issues by passing container pointer to setupIO, and get
rootuid/rootgid only when we need those.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):
> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.
Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.
Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).
What happens next is:
1. runc start/run/exec converts those credentials to strings (a User
string containing "UID:GID", and a []string for additional GIDs) and
passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
possible (parsing container's /etc/passwd and /etc/group).
All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).
The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).
Let's remove this back and forth translation, hopefully increasing
runc exec performance.
The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).
PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.
Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.
For 3998.
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This code is not in libcontainer, meaning it is only used by a short lived
binary (runc start/run/exec). Unlike code in libcontainer (see
CreateLibcontainerConfig), here we don't have to care about copying the
structures supplied as input, meaning we can just reuse the pointers
directly.
Fixes: bfbd0305 ("Add I/O priority")
Fixes: 770728e1 ("Support `process.scheduler`")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If container.Destroy() has failed, runc destroy still return 0, which is
wrong and can result in other issues down the line.
Let's always return error from destroy in runc delete.
For runc checkpoint and runc run, we still treat it as a warning.
Co-authored-by: Zhang Tianyang <burning9699@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.
And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
* We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
* Revert 97f5ee4e6a ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
* Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
produce errors rather than silently producing incorrect atime
mounts.
* Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
[1]: https://github.com/util-linux/util-linux/issues/2433
Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.
Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.
Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.
* contrib/completions/bash/runc:
Add `--no-mount-fallback` command line option for bash completion.
* create.go:
Add `--no-mount-fallback` command line option.
* restore.go:
Add `--no-mount-fallback` command line option.
* run.go:
Add `--no-mount-fallback` command line option.
* libcontainer/configs/config.go:
Add `NoMountFallback` field to the `Config` struct to store
the command line option value.
* libcontainer/specconv/spec_linux.go:
Add `NoMountFallback` field to the `CreateOpts` struct to store
the command line option value and store it in the libcontainer
config.
* utils_linux.go:
Store the command line option value in the `CreateOpts` struct.
* libcontainer/rootfs_linux.go:
In case that `--no-mount-fallback` is not set try to remount the
bind filesystem again with the options nodev, nosuid, noexec,
noatime, relatime, strictatime or nodiratime if they are set on
the source filesystem.
* tests/integration/mounts_sshfs.bats:
Add testcases and rework sshfs setup to allow specifying
different mount options depending on the test case.
Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.
Modify all the users and the docs accordingly.
While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The value of root is already an absolute path since commit
ede8a86ec1, so it does not make sense to call filepath.Abs()
again.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were never used (ctx was added by the initial commit, and
error was added by commit 25fd4a6757).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.
* -- in fact restore does not call revisePidFile, but it should.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of passing _LIBCONTAINER_LOGLEVEL as a string
(like "debug" or "info"), use a numeric value.
Also, simplify the init log level passing code -- since we actually use
the same level as the runc binary, just get it from logrus.
This is a preparation for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This fixes the following failure:
> sudo runc run -b bundle ctr </dev/null
> WARN[0000] exit status 2
> ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:459: container init caused:
The "exit status 2" with no error message is caused by SIGHUP
which is sent to init by the kernel when we are losing the
controlling terminal. If we choose to ignore that, we'll get
panic in console.Current(), which is addressed by [1].
Otherwise, the issue here is simple: the code assumes stdin
is opened to a terminal, and fails to work otherwise. Some
standard Linux tools (e.g. stty, top) do the same (modulo panic),
while some others (reset, tput) use the trick of trying
all the three std streams (starting with stderr as it is least likely
to be redirected), and if all three fails, open /dev/tty.
This commit does a similar thing (see initHostConsole).
It also replaces the call to console.Current(), which may panic
(see [1]), by reusing the t.hostConsole.
Finally, a simple test case is added.
Fixes: https://github.com/opencontainers/runc/issues/2485
[1] https://github.com/containerd/console/pull/37
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>