Commit Graph

39 Commits

Author SHA1 Message Date
Kir Kolyshkin
83350c24a9 libct/system: rm Fexecve
This helper was added for runc-dmz in commit dac417174, but runc-dmz was
later removed in commit 871057d, which forgot to remove the helper.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-03 13:57:05 -08:00
Aleksa Sarai
dd827f7b71 utils: switch to securejoin.MkdirAllHandle
filepath-securejoin has a bunch of extra hardening features and is very
well-tested, so we should use it instead of our own homebrew solution.

A lot of rootfs_linux.go callers pass a SecureJoin'd path, which means
we need to keep the wrapper helpers in utils, but at least the core
logic is no longer in runc. In future we will want to remove this dodgy
logic and just use file handles for everything (using libpathrs,
ideally).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-09-03 23:06:47 +10:00
Aleksa Sarai
63c2908164 rootfs: try to scope MkdirAll to stay inside the rootfs
While we use SecureJoin to try to make all of our target paths inside
the container safe, SecureJoin is not safe against an attacker than can
change the path after we "resolve" it.

os.MkdirAll can inadvertently follow symlinks and thus an attacker could
end up tricking runc into creating empty directories on the host (note
that the container doesn't get access to these directories, and the host
just sees empty directories). However, this could potentially cause DoS
issues by (for instance) creating a directory in a conf.d directory for
a daemon that doesn't handle subdirectories properly.

In addition, the handling for creating file bind-mounts did a plain
open(O_CREAT) on the SecureJoin'd path, which is even more obviously
unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an
attacker to cause data loss...). Regardless of the symlink issue,
opening an untrusted file could result in a DoS if the file is a hung
tty or some other "nasty" file. We can use mknodat to safely create a
regular file without opening anything anyway (O_CREAT|O_EXCL would also
work but it makes the logic a bit more complicated, and we don't want to
open the file for any particular reason anyway).

libpathrs[1] is the long-term solution for these kinds of problems, but
for now we can patch this particular issue by creating a more restricted
MkdirAll that refuses to resolve symlinks and does the creation using
file descriptors. This is loosely based on a more secure version that
filepath-securejoin now has[2] and will be added to libpathrs soon[3].

[1]: https://github.com/openSUSE/libpathrs
[2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0
[3]: https://github.com/openSUSE/libpathrs/issues/10

Fixes: CVE-2024-45310
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-09-03 02:34:13 +10:00
Sebastiaan van Stijn
c14213399a remove pre-go1.17 build-tags
Removed pre-go1.17 build-tags with go fix;

    go fix -mod=readonly ./...

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2024-06-29 15:45:25 +02:00
Kir Kolyshkin
584afc6756 libct/system: ClearRlimitNofileCache for go 1.23
Go 1.23 tightens access to internal symbols, and even puts runc into
"hall of shame" for using an internal symbol (recently added by commit
da68c8e3). So, while not impossible, it becomes harder to access those
internal symbols, and it is a bad idea in general.

Since Go 1.23 includes https://go.dev/cl/588076, we can clean the
internal rlimit cache by setting the RLIMIT_NOFILE for ourselves,
essentially disabling the rlimit cache.

Once Go 1.22 is no longer supported, we will remove the go:linkname hack.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-06-01 13:02:29 -07:00
lifubang
a35f7d8093 fix comments for ClearRlimitNofileCache
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-16 20:34:43 +08:00
ls-ggg
da68c8e37b libct: clean cached rlimit nofile in go runtime
As reported in issue #4195, the new version(since 1.19) of go runtime
will cache rlimit-nofile. Before executing execve, the rlimit-nofile
of the process will be restored with the cache. In runc, this will
cause the rlimit-nofile set by the parent process for the container
to become invalid. It can be solved by clearing the cache.

Signed-off-by: ls-ggg <335814617@qq.com>
(cherry picked from commit f9f8abf310)
Signed-off-by: lifubang <lifubang@acmcoder.com>
2024-05-08 10:40:13 +00:00
Kir Kolyshkin
dbd0c3349f libct/system: rm Execv
This is not used since commit dac41717.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-05-07 14:09:22 -07:00
Zheao.Li
98511bb40e linux: Support setting execution domain via linux personality
carry #3126

Co-authored-by: Aditya R <arajan@redhat.com>
Signed-off-by: Zheao.Li <me@manjusaka.me>
2023-10-27 19:33:37 +08:00
Aleksa Sarai
90c8d36afe dmz: use sendfile(2) when cloning /proc/self/exe
This results in a 5-20% speedup of dmz.CloneBinary(), depending on the
machine.

io.Copy:

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/dmz
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkCloneBinary
  BenchmarkCloneBinary-8               139           8075074 ns/op
  PASS
  ok      github.com/opencontainers/runc/libcontainer/dmz 2.286s

unix.Sendfile:

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/dmz
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkCloneBinary
  BenchmarkCloneBinary-8               192           6382121 ns/op
  PASS
  ok      github.com/opencontainers/runc/libcontainer/dmz 2.415s

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-22 15:51:36 +10:00
lifubang
dac4171746 runc-dmz: reduce memfd binary cloning cost with small C binary
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.

This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).

It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.

This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.

[1]: bfedb58925

Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-22 15:38:19 +10:00
Aleksa Sarai
0e9a3358f8 nsexec: migrate memfd /proc/self/exe logic to Go code
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-22 15:13:18 +10:00
Kir Kolyshkin
f62f0bdfbf Remove nolint annotations for unix errno comparisons
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.

Thus, these annotations are no longer necessary. Hooray!

[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 17:28:10 -07:00
Kir Kolyshkin
8491d33482 Fix runc run "permission denied" when rootless
Since commit 957d97bcf4 was made to fix issue [7],
a few things happened:

- a similar functionality appeared in go 1.20 [1], so the issue
  mentioned in the comment (being removed) is no longer true;
- a bug in runc was found [2], which also affects go [3];
- the bug was fixed in go 1.21 [4] and 1.20.2 [5];
- a similar fix was made to x/sys/unix.Faccessat [6].

The essense of [2] is, even if a (non-root) user that the container is
run as does not have execute permission bit set for the executable, it
should still work in case runc has the CAP_DAC_OVERRIDE capability set.

To fix this [2] without reintroducing the older bug [7]:
- drop own Eaccess implementation;
- use the one from x/sys/unix for Go 1.19 (depends on [6]);
- do not use anything when Go 1.20+ is used.

NOTE it is virtually impossible to fix the bug [2] when Go 1.20 or Go
1.20.1 is used because of [3].

A test case is added by a separate commit.

Fixes: #3715.

[1] https://go-review.googlesource.com/c/go/+/414824
[2] https://github.com/opencontainers/runc/issues/3715
[3] https://go.dev/issue/58552
[4] https://go-review.googlesource.com/c/go/+/468735
[5] https://go-review.googlesource.com/c/go/+/469956
[6] https://go-review.googlesource.com/c/sys/+/468877
[7] https://github.com/opencontainers/runc/issues/3520

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-03-27 15:15:48 -07:00
Kir Kolyshkin
957d97bcf4 Fix error from runc run on noexec fs
When starting a new container, and the very last step of executing of a
user process fails (last lines of (*linuxStandardInit).Init), it is too
late to print a proper error since both the log pipe and the init pipe
are closed.

This is partially mitigated by using exec.LookPath() which is supposed
to say whether we will be able to execute or not. Alas, it fails to do
so when the binary to be executed resides on a filesystem mounted with
noexec flag.

A workaround would be to use access(2) with X_OK flag. Alas, it is not
working when runc itself is a setuid (or setgid) binary. In this case,
faccessat2(2) with AT_EACCESS can be used, but it is only available
since Linux v5.8.

So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to
access(2) for non-setuid runc, and do nothing for setuid runc (as there
is nothing we can do). Note that this check if in addition to whatever
exec.LookPath does.

Fixes https://github.com/opencontainers/runc/issues/3520

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-07-01 10:02:42 -07:00
Kir Kolyshkin
db4ad6a7f1 libcontainer/system: rm Prlimit
It is now available from golang.org/x/sys/unix
(https://go-review.googlesource.com/c/sys/+/332029)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-11 19:57:56 -08:00
Kir Kolyshkin
794cd66df8 libct/system: Exec: wrap the error
If the container binary to be run is removed in between runc create
and runc start, the latter spits the following error:

> can't exec user process: no such file or directory

This is a bit confusing since we don't see what file is missing.

Wrap the unix.Exec error into os.PathError, like in many other cases,
to provide some context. Remove the error wrapping from
(*linuxStandardInit).Init as it is now redundant.

With this patch, the error is now:

> exec /bin/false: no such file or directory

Reported-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-10-07 11:09:08 -07:00
Kir Kolyshkin
d8da00355e *: add go-1.17+ go:build tags
Go 1.17 introduce this new (and better) way to specify build tags.
For more info, see https://golang.org/design/draft-gobuild.

As a way to seamlessly switch from old to new build tags, gofmt (and
gopls) from go 1.17 adds the new tags along with the old ones.

Later, when go < 1.17 is no longer supported, the old build tags
can be removed.

Now, as I started to use latest gopls (v0.7.1), it adds these tags
while I edit. Rather than to randomly add new build tags, I guess
it is better to do it once for all files.

Mind that previous commits removed some tags that were useless,
so this one only touches packages that can at least be built
on non-linux.

Brought to you by

        go1.17 fmt ./...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-08-30 20:58:22 -07:00
Maksim An
e39ad65059 retry unix.EINTR for container init process
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.

Signed-off-by: Maksim An <maksiman@microsoft.com>
2021-06-30 22:22:31 -07:00
Sebastiaan van Stijn
4316df8b53 libcontainer/system: move userns utilities to separate package
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.

Looking at uses of these utilities (outside of runc itself);

`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.

[1]: https://grep.app/search?current=2&q=.RunningInUserNS

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-04 22:42:03 +02:00
Sebastiaan van Stijn
e7fd383bce libcontainer/system: un-export UIDMapInUserNS()
`UIDMapInUserNS()` is not used anywhere, only internally. so un-export it.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-04 22:42:02 +02:00
Sebastiaan van Stijn
249356a1a4 libcontainer/system: remove unused GetParentNSeuid()
This function was added in f103de57ec, but no
longer used since 06f789cf26 (v1.0.0-rc6)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-04 22:41:59 +02:00
Sebastiaan van Stijn
9df0b5e268 libcontainer: RunningInUserNS() use sync.Once
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-04 15:53:33 +02:00
Kenta Tada
4474795388 libcontainer: use x/sys/unix instead of the hardcoded value
PR_SET_CHILD_SUBREAPER is defined in x/sys/unix.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-23 10:49:51 +09:00
Kir Kolyshkin
af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Tibor Vass
c205e9fb64 libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```

Signed-off-by: Tibor Vass <tibor@docker.com>
2018-06-14 18:33:14 +00:00
Akihiro Suda
f103de57ec main: support rootless mode in userns
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that `unshare --mount` requires `--map-root-user`)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Akihiro Suda
9c7d8bc1fd libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Sebastien Boeuf
bb912eb00c libcontainer: Do not wait for signalled processes if subreaper is set
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.

Fixes #1677

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-12-14 10:37:38 -08:00
Tobias Klauser
078e903296 libcontainer: use ioctl wrappers from x/sys/unix
Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually
reimplementing them.

Because of unlockpt, the ioctl wrapper is still needed as it needs to
pass a pointer to a value, which is not supported by any ioctl function
in x/sys/unix yet.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-10 10:56:58 +02:00
Tobias Klauser
a380fae959 libcontainer: use Prctl() from x/sys/unix
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-10 10:56:58 +02:00
Tobias Klauser
553016d7da Use Prctl() from x/sys/unix instead of own wrapper
Use unix.Prctl() instead of reimplemnting it as system.Prctl().

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-07 15:03:15 +02:00
Christy Perez
3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Akihiro Suda
1829531241 Fix trivial style errors reported by go vet and golint
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
2016-04-12 08:13:16 +00:00
Julian Friedman
e91b2b8aca Set rlimits using prlimit in parent
Fixes #680

This changes setupRlimit to use the Prlimit syscall (rather than
Setrlimit) and moves the call to the parent process. This is necessary
because Setrlimit would affect the libcontainer consumer if called in
the parent, and would fail if called from the child if the
child process is in a user namespace and the requested rlimit is higher
than that in the parent.

Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
2016-03-25 15:11:44 +00:00
Michael Crosby
fdb100d247 Destroy container along with processes before stdio
We need to make sure the container is destroyed before closing the stdio
for the container.  This becomes a big issues when running in the host's
pid namespace because the other processes could have inherited the stdio
of the initial process.  The call to close will just block as they still
have the io open.

Calling destroy before closing io, especially in the host pid namespace
will cause all additional processes to be killed in the container's
cgroup.  This will allow the io to be closed successfuly.

This change makes sure the order for destroy and close is correct as
well as ensuring that if any errors encoutered during start or exec will
be handled by terminating the process and destroying the container.  We
cannot use defers here because we need to enforce the correct ordering
on destroy.

This also sets the subreaper setting for runc so that when running in
pid host, runc can wait on the addiontal processes launched by the
container, useful on destroy, but also good for reaping the additional
processes that were launched.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-15 13:17:11 -07:00
Mrunal Patel
38b39645d9 Implement NoNewPrivileges support in libcontainer
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-02-16 06:57:50 -08:00
Serge Hallyn
c0ad40c5e6 Do not create devices when in user namespace
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.

If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing.  Add a function to detect that
and check for it before doing mknod.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 Changelog - add a comment clarifying what's going on with the
	     uidmap file.
2016-01-08 12:54:08 -08:00
Michael Crosby
8f97d39dd2 Move libcontainer into subdirectory
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:15 -07:00