This removes libcontainer's own error wrapping system, consisting of a
few types and functions, aimed at typization, wrapping and unwrapping
of errors, as well as saving error stack traces.
Since Go 1.13 now provides its own error wrapping mechanism and a few
related functions, it makes sense to switch to it.
While doing that, improve some error messages so that they start
with "error", "unable to", or "can't".
A few things that are worth mentioning:
1. We lose stack traces (which were never shown anyway).
2. Users of libcontainer that relied on particular errors (like
ContainerNotExists) need to switch to using errors.Is with
the new errors defined in error.go.
3. encoding/json is unable to unmarshal the built-in error type,
so we have to introduce initError and wrap the errors into it
(basically passing the error as a string). This is the same
as it was before, just a tad simpler (actually the initError
is a type that got removed in commit afa844311; also suddenly
ierr variable name makes sense now).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This should result in no change when the error is printed, but make the
errors returned unwrappable, meaning errors.As and errors.Is will work.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors from os.Open, os.Symlink etc do not need to be wrapped, as they
are already wrapped into os.PathError.
Error from unix are bare errnos and need to be wrapped. Same
os.PathError is a good candidate.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Errors returned by unix are bare. In some cases it's impossible to find
out what went wrong because there's is not enough context.
Add a mountError type (mostly copy-pasted from github.com/moby/sys/mount),
and mount/unmount helpers. Use these where appropriate, and convert error
checks to use errors.Is.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules.
Brought to you by
git ls-files \*.go | grep -v ^vendor/ | xargs gofumpt -s -w
Looking at the diff, all these changes make sense.
Also, replace gofmt with gofumpt in golangci.yml.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.
This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.
While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
container).
The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.
A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).
Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <champetier.etienne@gmail.com>
Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com>
Reviewed-by: Samuel Karp <skarp@amazon.com>
Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin)
Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
In case of rootless, cgroup2 mount is not possible (see [1] for more
details), so since commit 9c81440fb5 runc bind-mounts the whole
/sys/fs/cgroup into container.
Problem is, if cgroupns is enabled, /sys/fs/cgroup inside the container
is supposed to show the cgroup files for this cgroup, not the root one.
The fix is to pass through and use the cgroup path in case cgroup2
mount failed, cgroupns is enabled, and the path is non-empty.
Surely this requires the /sys/fs/cgroup mount in the spec, so modify
runc spec --rootless to keep it.
Before:
$ ./runc run aaa
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup
/sys/fs/cgroup/user.slice
/sys/fs/cgroup/user.slice/user-1000.slice
/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service
...
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 nobody nogroup 0 Feb 24 02:22 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
142 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
cat: can't open '/sys/fs/cgroup/memory.current': No such file or directory
After:
# find /sys/fs/cgroup/ -type d
/sys/fs/cgroup/
# ls -l /sys/fs/cgroup/cgroup.controllers
-r--r--r-- 1 root root 0 Feb 24 02:43 /sys/fs/cgroup/cgroup.controllers
# wc -w /sys/fs/cgroup/cgroup.procs
2 /sys/fs/cgroup/cgroup.procs
# cat /sys/fs/cgroup/memory.current
577536
[1] https://github.com/opencontainers/runc/issues/2158
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The code is already passing three parameters around from
mountToRootfs to mountCgroupV* to mountToRootfs again.
I am about to add another parameter, so let's introduce and
use struct mountConfig to pass around.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, runc fails like this when used from rootless podman
with host PID namespace:
> $ podman --runtime=runc run --pid=host --rm -it busybox sh
> WARN[0000] additional gid=10 is not present in the user namespace, skip setting it
> Error: container_linux.go:380: starting container process caused:
> process_linux.go:545: container init caused: readonly path /proc/asound:
> operation not permitted: OCI permission denied
(Here /proc/asound is the first path from OCI spec's readonlyPaths).
The code uses MS_BIND|MS_REMOUNT flags that have a special meaning in
the kernel ("keep the flags like nodev, nosuid, noexec as is").
For some reason, this "special meaning" trick is not working for the
above use case (rootless podman + no PID namespace), and I don't know
how to reproduce this without podman.
Instead of relying on the kernel feature, let's just get the current
mount flags using fstatfs(2) and add those that needs to be preserved.
While at it, wrap errors from unix.Mount into os.PathError to make
errors a bit less cryptic.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Moving these utilities to a separate package, so that consumers of this
package don't have to pull in the whole "system" package.
Looking at uses of these utilities (outside of runc itself);
`RunningInUserNS()` is used by [various external consumers][1],
so adding a "Deprecated" alias for this.
[1]: https://grep.app/search?current=2&q=.RunningInUserNS
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 9c1242ecb ("Add white list for bind mount chec", Jan 6 2016)
added a set of entries under /proc which we allow to be mounted to,
for the benefit of lxcfs-like fuse-backed hack to have container's
own version of /proc/meminfo etc.
For some reason, the allow list check is performed at the very
beginning of the function, which is not optimal.
Move the check to the end -- at this point in the code we already
know we're under /proc, so it make sense to consult the allow list.
This makes the code slightly more logical and hopefully slightly faster.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case a tmpfs mount path contains absolute symlinks, runc errors out
because those symlinks are resolved in the host (rather than container)
filesystem scope.
The fix is similar to that for bind mounts -- resolve the destination
in container rootfs scope using securejoin, and use the resolved path.
A simple integration test case is added to prevent future regressions.
Fixes https://github.com/opencontainers/runc/issues/2683.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Move the Device-related types to libcontainer/devices, so that
the package can be used in isolation. Aliases have been created
in libcontainer/configs for backward compatibility.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
if we are remounting root read only when in a user namespace, make
sure the existing flags (e.g. MS_NOEXEC, MS_NODEV) are maintained
otherwise the mount fails with EPERM.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In a case where the host /proc mount has already been overmounted, the
MS_MOVE handling would get ENOENT when trying to hide (for instance)
"/proc/bus" because it had already hidden away "/proc". This revealed
two issues in the previous implementation of this hardening feaure:
1. No checks were done to make sure the mount was a "full" mount (it is
a mount of the root of the filesystem), but the kernel doesn't permit
a non-full mount to be converted to a full mount (for reference, see
mnt_already_visible). This just removes extra busy-work during setup.
2. ENOENT was treated as a critical error, even though it actually
indicates the mount doesn't exist and thus isn't a problem. A more
theoretically pure solution would be to store the set of mountpoints
to be hidden and only ignore the error if an ancestor directory of
the current mountpoint was already hidden, but that would just add
complexity with little justification.
In addition, better document the reasoning behind this logic so that
folks aren't confused when looking at it.
Fixes: 28a697cce3 ("rootfs: umount all procfs and sysfs with --no-pivot")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This looks like this is just filling logs for years, since the kernel never
added the support for automatically labeling /dev/mqueue.
Removes these dmesg lines
[ 1731.969847] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1736.985146] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.356796] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.479952] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1738.628935] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1763.433276] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1806.802133] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1806.982003] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1808.955390] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1815.951076] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1827.257757] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1828.947888] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1834.964451] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 1835.941465] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Cesar Talledo (2):
Remove runc default devices that overlap with spec devices.
Skip redundant setup for /dev/ptmx when specified explicitly in the OCI spec.
LGTMs: @AkihiroSuda @cyphar
Closes#2522
Per the OCI spec, /dev/ptmx is always a symlink to /dev/pts/ptmx. As such, if
the OCI spec has an explicit entry for /dev/ptmx, runc shall ignore it.
This change ensures this is the case. A integration test was also added
(in tests/integration/dev.bats).
Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).
In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).
This is one of many changes needed to clean up the devices cgroup code.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The error message was including both the rootfs path, and the full
mount path, which also includes the path of the rootfs.
This patch removes the rootfs path from the error message, as it
was redundant, and made the error message overly verbose
Before this patch (errors wrapped for readability):
```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```
With this patch applied:
```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
To make a bind mount read-only, it needs to be remounted. This is what
the code removed does, but it is not needed here.
We have to deal with three cases here:
1. cgroup v2 unified mode. In this case the mount is real mount with
fstype=cgroup2, and there is no need to have a bind mount on top,
as we pass readonly flag to the mount as is.
2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the
"mount" is in fact a set of real mounts with fstype=cgroup, and
they are all performed in mountCgroupV1, with readonly flag
added if needed.
3. cgroup v1 as is (enableCgroupns == false). In this case
mountCgroupV1() calls mountToRootfs() again with an argument
from the list obtained from getCgroupMounts(), i.e. a bind
mount with the same flags as the original mount has (plus
unix.MS_BIND | unix.MS_REC), and mountToRootfs() does remounting
(under the case "bind":).
So, the code which this patch is removing is not needed -- it
essentially does nothing in case 3 above (since the bind mount
is already remounted readonly), and in cases 1 and 2 it
creates an unneeded extra bind mount on top of a real one (or set of
real ones).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).
So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.
Without this fix, checkpoint integration tests fail on cgroup v2.
Also, same is true for cgroup v1 + cgroupns.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs()
2. mount points from /proc/self/mountinfo are absolute and clean, too
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).
This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.
Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.
This behavior correspond to crun v0.10.2.
Fix#2158
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Fixes#2128
This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.
```bash
> sudo docker run --rm apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.
> sudo docker run --rm -v /proc:/proc apparmor
docker-default (enforce) root 18989 0.9 0.0 1288 4 ?
Ss 16:47 0:00 sleep 20
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.
Signed-off-by: Adrian Reber <areber@redhat.com>
When creating a new user namespace, the kernel doesn't allow to mount
a new procfs or sysfs file system if there is not already one instance
fully visible in the current mount namespace.
When using --no-pivot we were effectively inhibiting this protection
from the kernel, as /proc and /sys from the host are still present in
the container mount namespace.
A container without full access to /proc could then create a new user
namespace, and from there able to mount a fully visible /proc, bypassing
the limitations in the container.
A simple reproducer for this issue is:
unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger"
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:
```
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "cgroup"
}
],
```
Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>