With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
If container.Destroy() has failed, runc destroy still return 0, which is
wrong and can result in other issues down the line.
Let's always return error from destroy in runc delete.
For runc checkpoint and runc run, we still treat it as a warning.
Co-authored-by: Zhang Tianyang <burning9699@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.
And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.
A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.
During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.
Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.
NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.
So, to resolve this:
* We store which flags were explicitly requested to be cleared by the
user, so that we can detect whether the userns fallback path would end
up setting a flag the user explicitly wished to clear. If so, we
return an error because we couldn't fulfil the configuration settings.
* Revert 97f5ee4e6a ("Only remount if requested flags differ from
current"), as missing flags do not mean we can skip MS_REMOUNT (in
fact, missing flags are how you indicate a flag needs to be cleared
with mount(2)). The original purpose of the patch was to fix the
userns issue, but as mentioned above the correct mechanism is to do a
fallback mount that copies the lockable flags from statfs(2).
* Improve handling of atime in the fallback case by:
- Correctly handling the returned flags in statfs(2).
- Implement the MNT_LOCK_ATIME checks in our code to ensure we
produce errors rather than silently producing incorrect atime
mounts.
* Improve the tests so we correctly detect all of these contingencies,
including a general "bind-mount atime handling" test to ensure that
the behaviour described here is accurate.
This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.
[1]: https://github.com/util-linux/util-linux/issues/2433
Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.
Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.
Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.
* contrib/completions/bash/runc:
Add `--no-mount-fallback` command line option for bash completion.
* create.go:
Add `--no-mount-fallback` command line option.
* restore.go:
Add `--no-mount-fallback` command line option.
* run.go:
Add `--no-mount-fallback` command line option.
* libcontainer/configs/config.go:
Add `NoMountFallback` field to the `Config` struct to store
the command line option value.
* libcontainer/specconv/spec_linux.go:
Add `NoMountFallback` field to the `CreateOpts` struct to store
the command line option value and store it in the libcontainer
config.
* utils_linux.go:
Store the command line option value in the `CreateOpts` struct.
* libcontainer/rootfs_linux.go:
In case that `--no-mount-fallback` is not set try to remount the
bind filesystem again with the options nodev, nosuid, noexec,
noatime, relatime, strictatime or nodiratime if they are set on
the source filesystem.
* tests/integration/mounts_sshfs.bats:
Add testcases and rework sshfs setup to allow specifying
different mount options depending on the test case.
Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
The only implementation of these is linuxContainer. It does not make
sense to have an interface with a single implementation, and we do not
foresee other types of containers being added to runc.
Remove BaseContainer and Container interfaces, moving their methods
documentation to linuxContainer.
Rename linuxContainer to Container.
Adopt users from using interface to using struct.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since LinuxFactory has become the means to specify containers state
top directory (aka --root), and is only used by two methods (Create
and Load), it is easier to pass root to them directly.
Modify all the users and the docs accordingly.
While at it, fix Create and Load docs (those that were originally moved
from the Factory interface docs).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The value of root is already an absolute path since commit
ede8a86ec1, so it does not make sense to call filepath.Abs()
again.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove intelrtd.Manager interface, since we only have a single
implementation, and do not expect another one.
Rename intelRdtManager to Manager, and modify its users accordingly.
Remove NewIntelRdtManager from factory.
Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the
feature is not available.
Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew.
Add internal function newManager to be used for tests (to make sure
some testing is done even when the feature is not available in
kernel/hardware).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
These were introduced in commit d8b669400 back in 2017, with a TODO
of "make binary names configurable". Apparently, everyone is happy with
the hardcoded names. In fact, they *are* configurable (by prepending the
PATH with a directory containing own version of newuidmap/newgidmap).
Now, these binaries are only needed in a few specific cases (when
rootless is set etc.), so let's look them up only when needed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This was introduced in an initial commit, back in the day when criu was
a highly experimental thing. Today it's not; most users who need it have
it packaged by their distro vendor.
The usual way to run a binary is to look it up in directories listed in
$PATH. This is flexible enough and allows for multiple scenarios (custom
binaries, extra binaries, etc.). This is the way criu should be run.
Make --criu a hidden option (thus removing it from help). Remove the
option from man pages, integration tests, etc. Remove all traces of
CriuPath from data structures.
Add a warning that --criu is ignored and will be removed.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Those were never used (ctx was added by the initial commit, and
error was added by commit 25fd4a6757).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In some setups, multiple cgroups are used inside a container,
and sometime there is a need to execute a process in a particular
sub-cgroup (in case of cgroup v1, for a particular controller).
This is what this commit implements.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make Rootless and Systemd flags part of config.Cgroups.
2. Make all cgroup managers (not just fs2) return error (so it can do
more initialization -- added by the following commits).
3. Replace complicated cgroup manager instantiation in factory_linux
by a single (and simple) libcontainer/cgroups/manager.New() function.
4. getUnifiedPath is simplified to check that only a single path is
supplied (rather than checking that other paths, if supplied,
are the same).
[v2: can't -> cannot]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.
* -- in fact restore does not call revisePidFile, but it should.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of passing _LIBCONTAINER_LOGLEVEL as a string
(like "debug" or "info"), use a numeric value.
Also, simplify the init log level passing code -- since we actually use
the same level as the runc binary, just get it from logrus.
This is a preparation for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For files that end with _linux.go or _linux_test.go, there is no need to
specify linux build tag, as it is assumed from the file name.
In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go
for the file name to make sense.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use sync.Once to init Intel RDT when needed for a small speedup to
operations which do not require Intel RDT.
Simplify IntelRdtManager initialization in LinuxFactory.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This fixes the following failure:
> sudo runc run -b bundle ctr </dev/null
> WARN[0000] exit status 2
> ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:459: container init caused:
The "exit status 2" with no error message is caused by SIGHUP
which is sent to init by the kernel when we are losing the
controlling terminal. If we choose to ignore that, we'll get
panic in console.Current(), which is addressed by [1].
Otherwise, the issue here is simple: the code assumes stdin
is opened to a terminal, and fails to work otherwise. Some
standard Linux tools (e.g. stty, top) do the same (modulo panic),
while some others (reset, tput) use the trick of trying
all the three std streams (starting with stderr as it is least likely
to be redirected), and if all three fails, open /dev/tty.
This commit does a similar thing (see initHostConsole).
It also replaces the call to console.Current(), which may panic
(see [1]), by reusing the t.hostConsole.
Finally, a simple test case is added.
Fixes: https://github.com/opencontainers/runc/issues/2485
[1] https://github.com/containerd/console/pull/37
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
if NOTIFY_SOCKET is used, do not block the main runc process waiting
for events on the notify socket. Bind mount the parent directory of
the notify socket, so that "start" can create the socket and it is
still accessible from the container.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>