Commit Graph

33 Commits

Author SHA1 Message Date
lfbzhm
95a93c132c Merge pull request #4045 from fuweid/support-pidfd-socket
[feature request] *: introduce pidfd-socket flag
2023-11-22 09:13:55 +08:00
Wei Fu
94505a046a *: introduce pidfd-socket flag
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.

And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2023-11-21 18:28:50 +08:00
Aleksa Sarai
7c71a22705 rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT
The original reasoning for this option was to avoid having mount options
be overwritten by runc. However, adding command-line arguments has
historically been a bad idea because it forces strict-runc-compatible
OCI runtimes to copy out-of-spec features directly from runc and these
flags are usually quite difficult to enable by users when using runc
through several layers of engines and orchestrators.

A far more preferable solution is to have a heuristic which detects
whether copying the original mount's mount options would override an
explicit mount option specified by the user. In this case, we should
return an error. You only end up in this path in the userns case, if you
have a bind-mount source with locked flags.

During the course of writing this patch, I discovered that several
aspects of our handling of flags for bind-mounts left much to be
desired. We have completely botched the handling of explicitly cleared
flags since commit 97f5ee4e6a ("Only remount if requested flags differ
from current"), with our behaviour only becoming increasingly more weird
with 50105de1d8 ("Fix failure with rw bind mount of a ro fuse") and
da780e4d27 ("Fix bind mounts of filesystems with certain options
set"). In short, we would only clear flags explicitly request by the
user purely by chance, in ways that it really should've been reported to
us by now. The most egregious is that mounts explicitly marked "rw" were
actually mounted "ro" if the bind-mount source was "ro" and no other
special flags were included. In addition, our handling of atime was
completely broken -- mostly due to how subtle the semantics of atime are
on Linux.

Unfortunately, while the runtime-spec requires us to implement
mount(8)'s behaviour, several aspects of the util-linux mount(8)'s
behaviour are broken and thus copying them makes little sense. Since the
runtime-spec behaviour for this case (should mount options for a "bind"
mount use the "mount --bind -o ..." or "mount --bind -o remount,..."
semantics? Is the fallback code we have for userns actually
spec-compliant?) and the mount(8) behaviour (see [1]) are not
well-defined, this commit simply fixes the most obvious aspects of the
behaviour that are broken while keeping the current spirit of the
implementation.

NOTE: The handling of atime in the base case is left for a future PR to
deal with. This means that the atime of the source mount will be
silently left alone unless the fallback path needs to be taken, and any
flags not explicitly set will be cleared in the base case. Whether we
should always be operating as "mount --bind -o remount,..." (where we
default to the original mount source flags) is a topic for a separate PR
and (probably) associated runtime-spec PR.

So, to resolve this:

* We store which flags were explicitly requested to be cleared by the
  user, so that we can detect whether the userns fallback path would end
  up setting a flag the user explicitly wished to clear. If so, we
  return an error because we couldn't fulfil the configuration settings.

* Revert 97f5ee4e6a ("Only remount if requested flags differ from
  current"), as missing flags do not mean we can skip MS_REMOUNT (in
  fact, missing flags are how you indicate a flag needs to be cleared
  with mount(2)). The original purpose of the patch was to fix the
  userns issue, but as mentioned above the correct mechanism is to do a
  fallback mount that copies the lockable flags from statfs(2).

* Improve handling of atime in the fallback case by:
    - Correctly handling the returned flags in statfs(2).
    - Implement the MNT_LOCK_ATIME checks in our code to ensure we
      produce errors rather than silently producing incorrect atime
      mounts.

* Improve the tests so we correctly detect all of these contingencies,
  including a general "bind-mount atime handling" test to ensure that
  the behaviour described here is accurate.

This change also inlines the remount() function -- it was only ever used
for the bind-mount remount case, and its behaviour is very bind-mount
specific.

[1]: https://github.com/util-linux/util-linux/issues/2433

Reverts: 97f5ee4e6a ("Only remount if requested flags differ from current")
Fixes: 50105de1d8 ("Fix failure with rw bind mount of a ro fuse")
Fixes: da780e4d27 ("Fix bind mounts of filesystems with certain options set")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-10-24 17:28:25 +11:00
Ruediger Pluem
da780e4d27 Fix bind mounts of filesystems with certain options set
Currently bind mounts of filesystems with nodev, nosuid, noexec,
noatime, relatime, strictatime, nodiratime options set fail in rootless
mode if the same options are not set for the bind mount.
For ro filesystems this was resolved by #2570 by remounting again
with ro set.

Follow the same approach for nodev, nosuid, noexec, noatime, relatime,
strictatime, nodiratime but allow to revert back to the old behaviour
via the new `--no-mount-fallback` command line option.

Add a testcase to verify that bind mounts of filesystems with nodev,
nosuid, noexec, noatime options set work in rootless mode.
Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem
with a ro flag.
Add two further testcases that ensure that the above testcases would
fail if the `--no-mount-fallback` command line option is set.

* contrib/completions/bash/runc:
      Add `--no-mount-fallback` command line option for bash completion.

* create.go:
      Add `--no-mount-fallback` command line option.

* restore.go:
      Add `--no-mount-fallback` command line option.

* run.go:
      Add `--no-mount-fallback` command line option.

* libcontainer/configs/config.go:
      Add `NoMountFallback` field to the `Config` struct to store
      the command line option value.

* libcontainer/specconv/spec_linux.go:
      Add `NoMountFallback` field to the `CreateOpts` struct to store
      the command line option value and store it in the libcontainer
      config.

* utils_linux.go:
      Store the command line option value in the `CreateOpts` struct.

* libcontainer/rootfs_linux.go:
      In case that `--no-mount-fallback` is not set try to remount the
      bind filesystem again with the options nodev, nosuid, noexec,
      noatime, relatime, strictatime or nodiratime if they are set on
      the source filesystem.

* tests/integration/mounts_sshfs.bats:
      Add testcases and rework sshfs setup to allow specifying
      different mount options depending on the test case.

Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>
2023-07-28 16:32:02 -07:00
Kir Kolyshkin
d974b22ac4 create, run: amend final errors
As the error may contain anything, it may not be clear to a user that
the whole (create or run) operation failed. Amend the errors.

Also, change the code flow in create to match that of run, so we don't
have to add the fake "return nil" at the end.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-14 10:53:11 -07:00
Kir Kolyshkin
9ba2f65d6b startContainer: minor refactor
All three callers* of startContainer call revisePidFile and createSpec
before calling it, so it makes sense to move those calls to inside of
the startContainer, and drop the spec argument.

* -- in fact restore does not call revisePidFile, but it should.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-09-14 10:53:11 -07:00
Akihiro Suda
5fb9b2a006 Merge pull request #3185 from kolyshkin/go117-build-tags
Add go:build tags
2021-09-02 13:35:33 +09:00
Kir Kolyshkin
c5b0be78e8 Rm build tags from main pkg
This was added by commit 5aa82c950 back in the day when we thought
runc is going to be cross-platform. It's very clear now it's Linux-only
package.

While at it, further clarify it in README that we're Linux only.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:15:01 -07:00
lifubang
cb824629ba proposal: add --keep to runc run
Signed-off-by: lifubang <lifubang@acmcoder.com>
2021-08-02 12:51:36 -07:00
Andrei Vagin
a4fcbfb704 Prepare startContainer() to have more action
Currently startContainer() is used to create and to run a container.
In the next patch it will be used to restore a container.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:55:57 +03:00
Ian Campbell
f5adb05bce Add --preserve-fds=N to create and run
This preserves the given number of file descriptors on top of the 3 stdio and
the socket activation ($LISTEN_FDS=M) fds.

If LISTEN_FDS is not set then [3..3+N) would be preserved by --preserve-fds=N.

Given LISTEN_FDS=3 and --preserve-fds=5 then we would preserve fds [3, 11) (in
addition to stdio).  That's 3, 4 & 5 from LISTEN_FDS=3 and 6, 7, 8, 9 & 10 from
--preserve-fds=5.

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
2017-02-20 11:50:18 +00:00
Aleksa Sarai
c6d8a2f26f merge branch 'pr-1158'
Closes #1158
LGTMs: @hqhq @cyphar
2016-12-26 13:59:47 +11:00
Aleksa Sarai
7df64f8886 runc: implement --console-socket
This allows for higher-level orchestrators to be able to have access to
the master pty file descriptor without keeping the runC process running.
This is key to having (detach && createTTY) with a _real_ pty created
inside the container, which is then sent to a higher level orchestrator
over an AF_UNIX socket.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Aleksa Sarai
244c9fc426 *: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Zhang Wei
b517076907 Check args numbers before application start
Add a general args number validator for all client commands.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-11-29 11:18:51 +08:00
Wang Long
8676c75442 Fix the pid-file option for runc run/exec/create command
Signed-off-by: Wang Long <long.wanglong@huawei.com>
2016-11-02 14:08:32 +08:00
Aleksa Sarai
0636bdd45b Merge pull request #874 from crosbymichael/keyring
Add option to disable new session keys
2016-06-12 21:44:45 +10:00
Mrunal Patel
a753b06645 Replace github.com/codegangsta/cli by github.com/urfave/cli
The package got moved to a different repository

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-06-06 11:47:20 -07:00
Michael Crosby
8c9db3a7a5 Add option to disable new session keys
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.

Fixes #818

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-03 11:53:07 -07:00
Michael Crosby
6eba9b8ffb Fix SystemError and env lookup
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:10:47 -07:00
Michael Crosby
efcd73fb5b Fix signal handling for unit tests
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:10:47 -07:00
Michael Crosby
3fe7d7f31e Add create and start command for container lifecycle
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Michael Crosby
75fb70be01 Rename start to run
`runc run` is the command that will create and start a container in one
single command.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Doug Davis
714ae2acc9 Add a 'start' command
When any non-global-flag parameter appears on the command line make sure
there's a "command" even in the 'start' (run) case to ensure its not
ambiguous as to what the arg is. For example, w/o this fix its not
clear if
   runc foo
means 'foo' is the name of a config file or an unknown command.  Or worse,
you can't name a config file the same a ANY command, even future (yet to
be created) commands.

We should fix this now before we ship 1.0 and are forced to support this
ambiguous case for a long time.

Signed-off-by: Doug Davis <dug@us.ibm.com>
2015-08-21 15:26:34 -07:00
Fabio Kung
85f40c2bc7 container id is the cgroup name
Without this, multiple runc containers can accidentally share the same cgroup(s)
(and change each other's limits), when runc is invoked from the same directory
(i.e.: same cwd on multiple runc executions).

After these changes, each runc container will run on its own cgroup(s). Before,
the only workaround was to invoke runc from an unique (temporary?) cwd for each
container.

Common cgroup configuration (and hierarchical limits) can be set by having
multiple runc containers share the same cgroup parent, which is the cgroup of
the process executing runc.

Signed-off-by: Fabio Kung <fabio.kung@gmail.com>
2015-08-10 16:41:39 -07:00
Shishir Mahajan
27bfd1e2d1 systemd integration with container runtime for supporting sd_notify protocol
Signed-off-by: Shishir Mahajan <shishir.mahajan@redhat.com>
2015-07-22 15:53:26 -04:00
Jin-Hwan Jeong
cbee9e5050 wrong grammar: should never been --> should have never been 2015-07-08 16:55:23 +09:00
Michael Crosby
f4c35e70d1 Depend on Spec types from specs repository
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-07-02 13:45:27 -07:00
Marianna
5aa82c950d Enable build on unsupported platforms
Should compile now without errors but changes needed to be added for each system so it actually works.
main_unsupported.go is a new file with all the unsupported commands
Fixes #9

Signed-off-by: Marianna <mtesselh@gmail.com>
2015-06-29 17:03:44 -07:00
Michael Crosby
b2d9d99610 Only define a single process
This removes the Processes slice and only allows for one process of the
container.  It also renames TTY to Terminal for a cross platform
meaning.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-29 13:30:35 -07:00
Michael Crosby
1fa65466ea Move linux specific options to subsection
This moves the linux specific options into a "linux" {} section on the
config.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-29 13:30:35 -07:00
Phil Estes
9c56596f24 Make startup errors a bit friendlier
A couple minor changes to error handling in startup:
1. Don't dump full help/usage text when the only problem is `runc` wasn't started under
root privileges
2. Check for rootfs and make error clear to user when it doesn't exist
3. Change fatal to logrus.Fatal to get nicer output with simple error
message

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2015-06-25 00:19:44 -07:00
Michael Crosby
9fac183294 Initial commit of runc binary
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:34:13 -07:00