This simplifies the code flow and basically removes the last
filepath.Clean, which is not necessary in either case:
- for absolute path, single filepath.Clean is enough (as it is
guaranteed to remove all dot and dot-dot elements);
- for relative path, filepath.Rel calls Clean at the end
(which is even documented).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using strings.CutPrefix (available since Go 1.20) instead of
strings.HasPrefix and/or strings.TrimPrefix makes the code
a tad more straightforward.
No functional change.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using strings.HasPrefix with strings.TrimPrefix results in doing the
same thing (checking if prefix exists) twice. In this case, using
strings.TrimPrefix right away is sufficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It is available since Go 1.21 and is defined during compile time
(i.e. based on GOARCH during build).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.
In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:
* Checks that certain namespaces are supported by stat(2)ing
/proc/self/ns/...
* /proc/self/exe and /proc/self/cmdline are not thread-local.
* While threads can be in different cgroups, we do not do this for the
runc binary (or libcontainer) and thus we do not need to switch to
the thread-local version of /proc/self/cgroups.
* All of the CLONE_NEWUSER files are not thread-local because you
cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
is blocked for multi-threaded programs).
Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.
In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.
Yes this is all pretty ugly. I also wish it wasn't necessary.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.
Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.
One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Using strings.Split generates temporary strings for GC to collect.
Rewrite the function to not do that.
Also, add a second return value, so that the caller can distinguish
between an empty value found and no key found cases.
Fix the test accordingly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since commit 8850636eb3 (February 2015) this function is no longer
used (replaced by (*ConfigValidator).rootfs), so let's remove it,
together with its unit tests (which were added by commit 917c1f6d6 in
April 2016).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.
This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.
While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
container).
The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.
A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).
Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <champetier.etienne@gmail.com>
Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com>
Reviewed-by: Samuel Karp <skarp@amazon.com>
Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.
Unfortunately there are several issues which stop us from having a clean
solution to this problem:
1. libseccomp has several limitations which require us to emulate
behaviour we want:
a. We cannot do logic based on syscall number, meaning we cannot
specify a "largest known syscall number";
b. libseccomp doesn't know in which kernel version a syscall was
added, and has no API for "minimum kernel version" so we cannot
simply ask libseccomp to generate sane -ENOSYS rules for us.
c. Additional seccomp rules for the same syscall are not treated as
distinct rules -- if rules overlap, seccomp will merge them. This
means we cannot add per-syscall -EPERM fallbacks;
d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
e. libseccomp does not allow you to specify multiple rules for a
single argument, making it impossible to invert OR rules for
arguments.
2. The runtime-spec does not have any way of specifying:
a. The errno for the default action;
b. The minimum kernel version or "newest syscall at time of profile
creation"; nor
c. Which syscalls were intentionally excluded from the allow list
(weird syscalls that are no longer used were excluded entirely,
but Docker et al expect those syscalls to get EPERM not ENOSYS).
3. Certain syscalls should not return -ENOSYS (especially only for
certain argument combinations) because this could also trigger glibc
confusion. This means we have to return -EPERM for certain syscalls
but not as a global default.
4. There is not an obvious (and reasonable) upper limit to syscall
numbers, so we cannot create a set of rules for each syscall above
the largest syscall number in libseccomp. This means we must handle
inverse rules as described below.
5. Any syscall can be specified multiple times, which can make
generation of hotfix rules much harder.
As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.
The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.
Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.
The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
Ensure that path safety is maintained, this essentially reapplies
c0cad6aa5e ("cgroups: fs: fix cgroup.Parent path sanitisation"), which
was accidentally removed in 256f3a8ebc ("Add support for CgroupsPath
field").
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Marshall the raw objects for the sync pipes so that no new line chars
are left behind in the pipe causing errors.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>