Commit Graph

19 Commits

Author SHA1 Message Date
lfbzhm
95a93c132c Merge pull request #4045 from fuweid/support-pidfd-socket
[feature request] *: introduce pidfd-socket flag
2023-11-22 09:13:55 +08:00
Wei Fu
94505a046a *: introduce pidfd-socket flag
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.

And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-21 18:28:50 +08:00
Aleksa Sarai
7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
Ruediger Pluem
da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
Kir Kolyshkin
d974b22ac4 create, run: amend final errors
As the error may contain anything, it may not be clear to a user that
the whole (create or run) operation failed. Amend the errors.

Also, change the code flow in create to match that of run, so we don't
have to add the fake "return nil" at the end.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-14 10:53:11 -07:00
Kir Kolyshkin
9ba2f65d6b startContainer: minor refactor
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.

* -- in fact restore does not call revisePidFile, but it should.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-14 10:53:11 -07:00
Andrei Vagin
a4fcbfb704 Prepare startContainer() to have more action
Currently startContainer() is used to create and to run a container.
In the next patch it will be used to restore a container.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:55:57 +03:00
Ian Campbell
f5adb05bce Add --preserve-fds=N to create and run
This preserves the given number of file descriptors on top of the 3 stdio and
the socket activation ($LISTEN_FDS=M) fds.

If LISTEN_FDS is not set then [3..3+N) would be preserved by --preserve-fds=N.

Given LISTEN_FDS=3 and --preserve-fds=5 then we would preserve fds [3, 11) (in
addition to stdio).  That's 3, 4 & 5 from LISTEN_FDS=3 and 6, 7, 8, 9 & 10 from
--preserve-fds=5.

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
2017-02-20 11:50:18 +00:00
Aleksa Sarai
c6d8a2f26f merge branch 'pr-1158'
Closes #1158
LGTMs: @hqhq @cyphar
2016-12-26 13:59:47 +11:00
Aleksa Sarai
7df64f8886 runc: implement --console-socket
This allows for higher-level orchestrators to be able to have access to
the master pty file descriptor without keeping the runC process running.
This is key to having (detach && createTTY) with a _real_ pty created
inside the container, which is then sent to a higher level orchestrator
over an AF_UNIX socket.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Aleksa Sarai
244c9fc426 *: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Zhang Wei
b517076907 Check args numbers before application start
Add a general args number validator for all client commands.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-11-29 11:18:51 +08:00
Wang Long
8676c75442 Fix the pid-file option for runc run/exec/create command
Signed-off-by: Wang Long <long.wanglong@huawei.com>
2016-11-02 14:08:32 +08:00
Wang Long
ba1c0b4fa3 check the arguments for runc create
This patch checks the arguments for command  `runc create`.
the `create` command requires exactly one argument

eg:

root@ubuntu:~# runc create -b /mycontainer/ a
root@ubuntu:~# runc list
ID          PID         STATUS      BUNDLE         CREATED
a           61637       created     /mycontainer   2016-10-20T08:21:20.169810942Z
root@ubuntu:~# runc create -b /mycontainer/ a b
runc: "create" requires exactly one argument
root@ubuntu:~# runc create -b /mycontainer/
runc: "create" requires exactly one argument

Signed-off-by: Wang Long <long.wanglong@huawei.com>
2016-10-24 11:09:06 +08:00
Aleksa Sarai
0636bdd45b Merge pull request #874 from crosbymichael/keyring
Add option to disable new session keys
2016-06-12 21:44:45 +10:00
Mrunal Patel
a753b06645 Replace github.com/codegangsta/cli by github.com/urfave/cli
The package got moved to a different repository

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-06-06 11:47:20 -07:00
Michael Crosby
8c9db3a7a5 Add option to disable new session keys
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.

Fixes #818

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-03 11:53:07 -07:00
Michael Crosby
6eba9b8ffb Fix SystemError and env lookup
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:10:47 -07:00
Michael Crosby
3fe7d7f31e Add create and start command for container lifecycle
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00