This helper was added for runc-dmz in commit dac417174, but runc-dmz was
later removed in commit 871057d, which forgot to remove the helper.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
filepath-securejoin has a bunch of extra hardening features and is very
well-tested, so we should use it instead of our own homebrew solution.
A lot of rootfs_linux.go callers pass a SecureJoin'd path, which means
we need to keep the wrapper helpers in utils, but at least the core
logic is no longer in runc. In future we will want to remove this dodgy
logic and just use file handles for everything (using libpathrs,
ideally).
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
While we use SecureJoin to try to make all of our target paths inside
the container safe, SecureJoin is not safe against an attacker than can
change the path after we "resolve" it.
os.MkdirAll can inadvertently follow symlinks and thus an attacker could
end up tricking runc into creating empty directories on the host (note
that the container doesn't get access to these directories, and the host
just sees empty directories). However, this could potentially cause DoS
issues by (for instance) creating a directory in a conf.d directory for
a daemon that doesn't handle subdirectories properly.
In addition, the handling for creating file bind-mounts did a plain
open(O_CREAT) on the SecureJoin'd path, which is even more obviously
unsafe (luckily we didn't use O_TRUNC, or this bug could've allowed an
attacker to cause data loss...). Regardless of the symlink issue,
opening an untrusted file could result in a DoS if the file is a hung
tty or some other "nasty" file. We can use mknodat to safely create a
regular file without opening anything anyway (O_CREAT|O_EXCL would also
work but it makes the logic a bit more complicated, and we don't want to
open the file for any particular reason anyway).
libpathrs[1] is the long-term solution for these kinds of problems, but
for now we can patch this particular issue by creating a more restricted
MkdirAll that refuses to resolve symlinks and does the creation using
file descriptors. This is loosely based on a more secure version that
filepath-securejoin now has[2] and will be added to libpathrs soon[3].
[1]: https://github.com/openSUSE/libpathrs
[2]: https://github.com/cyphar/filepath-securejoin/releases/tag/v0.3.0
[3]: https://github.com/openSUSE/libpathrs/issues/10
Fixes: CVE-2024-45310
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Go 1.23 tightens access to internal symbols, and even puts runc into
"hall of shame" for using an internal symbol (recently added by commit
da68c8e3). So, while not impossible, it becomes harder to access those
internal symbols, and it is a bad idea in general.
Since Go 1.23 includes https://go.dev/cl/588076, we can clean the
internal rlimit cache by setting the RLIMIT_NOFILE for ourselves,
essentially disabling the rlimit cache.
Once Go 1.22 is no longer supported, we will remove the go:linkname hack.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As reported in issue #4195, the new version(since 1.19) of go runtime
will cache rlimit-nofile. Before executing execve, the rlimit-nofile
of the process will be restored with the cache. In runc, this will
cause the rlimit-nofile set by the parent process for the container
to become invalid. It can be solved by clearing the cache.
Signed-off-by: ls-ggg <335814617@qq.com>
(cherry picked from commit f9f8abf310)
Signed-off-by: lifubang <lifubang@acmcoder.com>
The idea is to remove the need for cloning the entire runc binary by
replacing the final execve() call of the container process with an
execve() call to a clone of a small C binary which just does an execve()
of its arguments.
This provides similar protection against CVE-2019-5736 but without
requiring a >10MB binary copy for each "runc init". When compiled with
musl, runc-dmz is 13kB (though unfortunately with glibc, it is 1.1MB
which is still quite large).
It should be noted that there is still a window where the container
processes could get access to the host runc binary, but because we set
ourselves as non-dumpable the container would need CAP_SYS_PTRACE (which
is not enabled by default in Docker) in order to get around the
proc_fd_access_allowed() checks. In addition, since Linux 4.10[1] the
kernel blocks access entirely for user namespaced containers in this
scenario. For those cases we cannot use runc-dmz, but most containers
won't have this issue.
This new runc-dmz binary can be opted out of at compile time by setting
the "runc_nodmz" buildtag, and at runtime by setting the RUNC_DMZ=legacy
environment variable. In both cases, runc will fall back to the classic
/proc/self/exe-based cloning trick. If /proc/self/exe is already a
sealed memfd (namely if the user is using contrib/cmd/memfd-bind to
create a persistent sealed memfd for runc), neither runc-dmz nor
/proc/self/exe cloning will be used because they are not necessary.
[1]: bfedb58925
Co-authored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
[cyphar: address various review nits]
[cyphar: fix runc-dmz cross-compilation]
[cyphar: embed runc-dmz into runc binary and clone in Go code]
[cyphar: make runc-dmz optional, with fallback to /proc/self/exe cloning]
[cyphar: do not use runc-dmz when the container has certain privs]
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This allow us to remove the amount of C code in runc quite
substantially, as well as removing a whole execve(2) from the nsexec
path because we no longer spawn "runc init" only to re-exec "runc init"
after doing the clone.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.
Thus, these annotations are no longer necessary. Hooray!
[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When starting a new container, and the very last step of executing of a
user process fails (last lines of (*linuxStandardInit).Init), it is too
late to print a proper error since both the log pipe and the init pipe
are closed.
This is partially mitigated by using exec.LookPath() which is supposed
to say whether we will be able to execute or not. Alas, it fails to do
so when the binary to be executed resides on a filesystem mounted with
noexec flag.
A workaround would be to use access(2) with X_OK flag. Alas, it is not
working when runc itself is a setuid (or setgid) binary. In this case,
faccessat2(2) with AT_EACCESS can be used, but it is only available
since Linux v5.8.
So, use faccessat2(2) with AT_EACCESS if available. If not, fall back to
access(2) for non-setuid runc, and do nothing for setuid runc (as there
is nothing we can do). Note that this check if in addition to whatever
exec.LookPath does.
Fixes https://github.com/opencontainers/runc/issues/3520
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If the container binary to be run is removed in between runc create
and runc start, the latter spits the following error:
> can't exec user process: no such file or directory
This is a bit confusing since we don't see what file is missing.
Wrap the unix.Exec error into os.PathError, like in many other cases,
to provide some context. Remove the error wrapping from
(*linuxStandardInit).Init as it is now redundant.
With this patch, the error is now:
> exec /bin/false: no such file or directory
Reported-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Go 1.17 introduce this new (and better) way to specify build tags.
For more info, see https://golang.org/design/draft-gobuild.
As a way to seamlessly switch from old to new build tags, gofmt (and
gopls) from go 1.17 adds the new tags along with the old ones.
Later, when go < 1.17 is no longer supported, the old build tags
can be removed.
Now, as I started to use latest gopls (v0.7.1), it adds these tags
while I edit. Rather than to randomly add new build tags, I guess
it is better to do it once for all files.
Mind that previous commits removed some tags that were useless,
so this one only touches packages that can at least be built
on non-linux.
Brought to you by
go1.17 fmt ./...
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running a script from an azure file share interrupted syscall
occurs quite frequently, to remedy this add retries around execve
syscall, when EINTR is returned.
Signed-off-by: Maksim An <maksiman@microsoft.com>
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.
Looking at uses of these utilities (outside of runc itself);
`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.
[1]: https://grep.app/search?current=2&q=.RunningInUserNS
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```
Signed-off-by: Tibor Vass <tibor@docker.com>
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.
Fixes#1677
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually
reimplementing them.
Because of unlockpt, the ioctl wrapper is still needed as it needs to
pass a pointer to a value, which is not supported by any ioctl function
in x/sys/unix yet.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.
Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
Fixes#680
This changes setupRlimit to use the Prlimit syscall (rather than
Setrlimit) and moves the call to the parent process. This is necessary
because Setrlimit would affect the libcontainer consumer if called in
the parent, and would fail if called from the child if the
child process is in a user namespace and the requested rlimit is higher
than that in the parent.
Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
We need to make sure the container is destroyed before closing the stdio
for the container. This becomes a big issues when running in the host's
pid namespace because the other processes could have inherited the stdio
of the initial process. The call to close will just block as they still
have the io open.
Calling destroy before closing io, especially in the host pid namespace
will cause all additional processes to be killed in the container's
cgroup. This will allow the io to be closed successfuly.
This change makes sure the order for destroy and close is correct as
well as ensuring that if any errors encoutered during start or exec will
be handled by terminating the process and destroying the container. We
cannot use defers here because we need to enforce the correct ordering
on destroy.
This also sets the subreaper setting for runc so that when running in
pid host, runc can wait on the addiontal processes launched by the
container, useful on destroy, but also good for reaping the additional
processes that were launched.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.
If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing. Add a function to detect that
and check for it before doing mknod.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
Changelog - add a comment clarifying what's going on with the
uidmap file.