Commit Graph

26 Commits

Author SHA1 Message Date
Rodrigo Campos
b0b186e64d Merge pull request #4630 from kolyshkin/clean-path
libc/utils: simplify CleanPath
2025-02-13 13:59:23 -03:00
Kir Kolyshkin
8db6ffbeef libc/utils: simplify CleanPath
This simplifies the code flow and basically removes the last
filepath.Clean, which is not necessary in either case:

 - for absolute path, single filepath.Clean is enough (as it is
   guaranteed to remove all dot and dot-dot elements);

 - for relative path, filepath.Rel calls Clean at the end
   (which is even documented).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-12 20:17:51 -08:00
Kir Kolyshkin
055041e874 libct: use strings.CutPrefix where possible
Using strings.CutPrefix (available since Go 1.20) instead of
strings.HasPrefix and/or strings.TrimPrefix makes the code
a tad more straightforward.

No functional change.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:35 -08:00
Kir Kolyshkin
259b71c042 libct/utils: stripRoot: rm useless HasPrefix
Using strings.HasPrefix with strings.TrimPrefix results in doing the
same thing (checking if prefix exists) twice. In this case, using
strings.TrimPrefix right away is sufficient.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:35 -08:00
Amir M. Ghazanfari
faffe1b9ee replace strings.SplitN with strings.Cut
Signed-off-by: Amir M. Ghazanfari <a.m.ghazanfari76@gmail.com>
2024-09-28 10:02:21 +03:30
Kir Kolyshkin
a31efe7045 libct/seccomp/patchbpf: use binary.NativeEndian
It is available since Go 1.21 and is defined during compile time
(i.e. based on GOARCH during build).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-09-11 22:06:58 -07:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
8da42aaec2 sync: split init config (stream) and synchronisation (seqpacket) pipes
We have different requirements for the initial configuration and
initWaiter pipe (just send netlink and JSON blobs with no complicated
handling needed for message coalescing) and the packet-based
synchronisation pipe.

Tests with switching everything to SOCK_SEQPACKET lead to endless issues
with runc hanging on start-up because random things would try to do
short reads (which SOCK_SEQPACKET will not allow and the Go stdlib
explicitly treats as a streaming source), so splitting it was the only
reasonable solution. Even doing somewhat dodgy tricks such as adding a
Read() wrapper which actually calls ReadPacket() and makes it seem like
a stream source doesn't work -- and is a bit too magical.

One upside is that doing it this way makes the difference between the
modes clearer -- INITPIPE is still used for initWaiter syncrhonisation
but aside from that all other synchronisation is done by SYNCPIPE.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-09-24 20:31:14 +08:00
Kir Kolyshkin
3d86d31b9f libct/utils: SearchLabels: optimize
Using strings.Split generates temporary strings for GC to collect.
Rewrite the function to not do that.

Also, add a second return value, so that the caller can distinguish
between an empty value found and no key found cases.

Fix the test accordingly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2022-01-26 14:01:11 -08:00
Kir Kolyshkin
b950b778c2 libct/utils: ResolveRootfs: remove
Since commit 8850636eb3 (February 2015) this function is no longer
used (replaced by (*ConfigValidator).rootfs), so let's remove it,
together with its unit tests (which were added by commit 917c1f6d6 in
April 2016).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-11-29 19:21:26 -08:00
Kir Kolyshkin
7be93a66b9 *: fmt.Errorf: use %w when appropriate
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2021-06-22 16:09:47 -07:00
Aleksa Sarai
0ca91f44f1 rootfs: add mount destination validation
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.

This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.

While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
container).

The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.

A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).

Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <champetier.etienne@gmail.com>
Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com>
Reviewed-by: Samuel Karp <skarp@amazon.com>
Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-05-19 16:58:35 +10:00
Aleksa Sarai
7a8d7162f9 seccomp: prepend -ENOSYS stub to all filters
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

 1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
       specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
       added, and has no API for "minimum kernel version" so we cannot
       simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
       distinct rules -- if rules overlap, seccomp will merge them. This
       means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
       single argument, making it impossible to invert OR rules for
       arguments.

 2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
       creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
       (weird syscalls that are no longer used were excluded entirely,
       but Docker et al expect those syscalls to get EPERM not ENOSYS).

 3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

 4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

 5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.

Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.

The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2021-01-28 23:11:22 +11:00
Shengjing Zhu
f4d153b086 Fix int overflow in test on 32 bit system
Signed-off-by: Shengjing Zhu <zhsj@debian.org>
2021-01-24 16:37:32 +08:00
Mrunal Patel
fe3d5c4c6e Remove unused veth setup code
Networking is setup by plugins for users of runc so it makes sense
to get rid of the veth strategy.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-24 15:41:52 -07:00
Christy Perez
3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Xianglin Gao
9df4847a23 tiny fix
Signed-off-by: Xianglin Gao <xlgao@zju.edu.cn>
2016-10-11 16:32:56 +08:00
Qiang Huang
dc0a4cf488 Fix TestGetAdditionalGroups on i686
Fixes: #941

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-09-27 18:25:53 +08:00
Michael Crosby
5abffd3100 Add annotations to list and state output
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-02 12:44:43 -07:00
George Lestaris
f7ae27bfb7 HookState adhears to OCI
Signed-off-by: George Lestaris <glestaris@pivotal.io>
Signed-off-by: Ed King <eking@pivotal.io>
2016-04-06 16:57:59 +01:00
Aleksa Sarai
b8dc5213e8 libcontainer: cgroups: fs: fix path safety
Ensure that path safety is maintained, this essentially reapplies
c0cad6aa5e ("cgroups: fs: fix cgroup.Parent path sanitisation"), which
was accidentally removed in 256f3a8ebc ("Add support for CgroupsPath
field").

Signed-off-by: Aleksa Sarai <asarai@suse.com>
2016-02-14 00:37:21 +11:00
Kenfe-Mickael Laventure
dceeb0d0df Move pathClean to libcontainer/utils.CleanPath
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2016-02-09 16:21:58 -08:00
Michael Crosby
ddcee3cc2a Do not use stream encoders
Marshall the raw objects for the sync pipes so that no new line chars
are left behind in the pipe causing errors.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-01-26 11:22:05 -08:00
rajasec
58e3cde8f3 Fixing typo in the comment for exit
Signed-off-by: rajasec <rajasec79@gmail.com>
2015-10-22 19:08:03 +05:30
John Howard
9f80f3f181 Windows: Factor out CloseExecFrom
Signed-off-by: John Howard <jhoward@microsoft.com>
2015-06-26 20:13:17 -07:00
Michael Crosby
8f97d39dd2 Move libcontainer into subdirectory
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:15 -07:00