Commit Graph

31 Commits

Author SHA1 Message Date
Kir Kolyshkin
ce3cd4234c criu: simplify isOnTmpfs check in prepareCriuRestoreMounts
Instead of generating a list of tmpfs mount and have a special function
to check whether the path is in the list, let's go over the list of
mounts directly. This simplifies the code and improves readability.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-05-20 16:56:55 -07:00
Kir Kolyshkin
f91fbd34d9 criu: inline makeCriuRestoreMountpoints
Since its code is now trivial, and it is only called from a single
place, it does not make sense to have it as a separate function.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-05-20 16:56:55 -07:00
Kir Kolyshkin
b8aa5481db criu: ignore cgroup early in prepareCriuRestoreMounts
It makes sense to ignore cgroup mounts much early in the code,
saving some time on unnecessary operations.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-05-20 16:56:55 -07:00
Kir Kolyshkin
0c93d41c65 criu: improve prepareCriuRestoreMounts
1. Replace the big "if !" block with the if block and continue,
   simplifying the code flow.

2. Move comments closer to the code, improving readability.

This commit is best reviewed with --ignore-all-space or similar.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-05-20 16:56:55 -07:00
Rodrigo Campos
c3a41d77db Merge pull request #4696 from avagin/criu-vs-exec
criu: Add time namespace to container config after checkpoint/restore
2025-04-01 14:54:33 -03:00
Kir Kolyshkin
17570625c0 Use for range over integers
This appears in Go 1.22 (see https://tip.golang.org/ref/spec#For_range).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-31 17:15:06 -07:00
Andrei Vagin
b68cbdff34 criu: Add time namespace to container config after checkpoint/restore
Since v3.14, CRIU always restores processes into a time namespace to
prevent backward jumps of monotonic and boottime clocks. This change
updates the container configuration to ensure that `runc exec` launches
new processes within the container's time namespace.

Fixes #2610

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2025-03-26 15:12:01 +00:00
Kir Kolyshkin
9510ffb658 Fix a few staticcheck QF1001 warnings
Like these:

> libcontainer/criu_linux.go:959:3: QF1001: could apply De Morgan's law (staticcheck)
> 		!(req.GetType() == criurpc.CriuReqType_FEATURE_CHECK ||
> 		^
> libcontainer/rootfs_linux.go:360:19: QF1001: could apply De Morgan's law (staticcheck)
> 	if err == nil || !(errors.Is(err, unix.EPERM) || errors.Is(err, unix.EBUSY)) {
> 	                 ^

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-03-25 16:06:44 -07:00
Prajwal S N
05e83fc600 deps: bump go-criu to v7
Signed-off-by: Prajwal S N <prajwalnadig21@gmail.com>
2025-03-05 01:02:53 +05:30
Kir Kolyshkin
a75076b4a4 Switch to opencontainers/cgroups
This removes libcontainer/cgroups packages and starts
using those from github.com/opencontainers/cgroups repo.

Mostly generated by:

  git rm -f libcontainer/cgroups

  find . -type f -name "*.go" -exec sed -i \
    's|github.com/opencontainers/runc/libcontainer/cgroups|github.com/opencontainers/cgroups|g' \
    {} +

  go get github.com/opencontainers/cgroups@v0.0.1
  make vendor
  gofumpt -w .

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-28 15:20:33 -08:00
Daniel Levi-Minzi
1d047e44ed expose criu options for link remap and skip in flight
Signed-off-by: Daniel Levi-Minzi <dleviminzi@gmail.com>
2025-02-25 10:35:31 -05:00
Kir Kolyshkin
8afeb58398 libct: add/use configs.HasHook
This allows to omit a call to c.currentOCIState (which can be somewhat
costly when there are many annotations) when the hooks of a given kind
won't be run.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-22 17:47:09 -08:00
Kir Kolyshkin
47dc185880 Add runc_nocriu build tag
This allows to make a 17% smaller runc binary by not compiling in
checkpoint/restore support.

It turns out that google.golang.org/protobuf package, used by go-criu,
is quite big, and go linker can't drop unused stuff if reflection is
used anywhere in the code.

Currently there's no alternative to using protobuf in go-criu, and since
not all users use c/r, let's provide them an option for a smaller
binary.

For the reference, here's top10 biggest vendored packages, as reported
by gsa[1]:

$ gsa runc | grep vendor | head
│ 8.59%   │ google.golang.org/protobuf                  │ 1.3 MB │ vendor    │
│ 5.76%   │ github.com/opencontainers/runc              │ 865 kB │ vendor    │
│ 4.05%   │ github.com/cilium/ebpf                      │ 608 kB │ vendor    │
│ 2.86%   │ github.com/godbus/dbus/v5                   │ 429 kB │ vendor    │
│ 1.25%   │ github.com/urfave/cli                       │ 188 kB │ vendor    │
│ 0.90%   │ github.com/vishvananda/netlink              │ 135 kB │ vendor    │
│ 0.59%   │ github.com/sirupsen/logrus                  │ 89 kB  │ vendor    │
│ 0.56%   │ github.com/checkpoint-restore/go-criu/v6    │ 84 kB  │ vendor    │
│ 0.51%   │ golang.org/x/sys                            │ 76 kB  │ vendor    │
│ 0.47%   │ github.com/seccomp/libseccomp-golang        │ 71 kB  │ vendor    │

And here is a total binary size saving when `runc_nocriu` is used.

For non-stripped binaries:

$ gsa runc-cr runc-nocr | tail -3
│ -17.04% │ runc-cr                                  │ 15 MB    │ 12 MB    │ -2.6 MB │
│         │ runc-nocr                                │          │          │         │
└─────────┴──────────────────────────────────────────┴──────────┴──────────┴─────────┘

And for stripped binaries:

│ -17.01% │ runc-cr-stripped                         │ 11 MB    │ 8.8 MB   │ -1.8 MB │
│         │ runc-nocr-stripped                       │          │          │         │
└─────────┴──────────────────────────────────────────┴──────────┴──────────┴─────────┘

[1]: https://github.com/Zxilly/go-size-analyzer

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-09 11:19:23 -08:00
Kir Kolyshkin
c487840f75 Remove main package dependency on criurpc
Commit 7f64fb47 made the main package, and runc/libcontainer's CriuOpts
depend on criu/rpc. This is not good; among the other things, it makes
it complicated to make c/r optional.

Let's switch CriuOpts.ManageCgroupsMode to a string (yes, it's an APIt
breaking change) and move the cgroup mode string parsing to
libcontainer.

While at it, let's better document ManageCgroupsMode.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-12-09 11:19:23 -08:00
yangzhao.hjh
324fcea4ec Terminate execution for criu that does not meet version requirements
Signed-off-by: yangzhao.hjh <yangzhao.hjh@alibaba-inc.com>
2024-10-11 09:56:07 +08:00
Aleksa Sarai
1410a6988d rootfs: consolidate mountpoint creation logic
The logic for how we create mountpoints is spread over each mountpoint
preparation function, when in reality the behaviour is pretty uniform
with only a handful of exceptions. So just move it all to one function
that is easier to understand.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-07-25 14:16:05 +10:00
Kir Kolyshkin
e676dac523 libct/criu: simplify checkCriuFeatures
Since criu 2.12, rpcOpts is not needed when checking criu features.
As we requires criu >= 3.0 in Checkpoint, we can remove rpcOpts.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-05-09 15:18:29 -07:00
Kir Kolyshkin
f6a8c9b816 libct: checkCriuFeatures: return underlying error
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-05-09 15:18:29 -07:00
Akihiro Suda
3f4a73d632 TestCheckpoint: skip on ErrCriuMissingFeatures
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2023-12-19 18:47:03 +09:00
Aleksa Sarai
cdff09ab87 rootfs: fix 'can we mount on top of /proc' check
Our previous test for whether we can mount on top of /proc incorrectly
assumed that it would only be called with bind-mount sources. This meant
that having a non bind-mount entry for a pseudo-filesystem (like
overlayfs) with a dummy source set to /proc on the host would let you
bypass the check, which could easily lead to security issues.

In addition, the check should be applied more uniformly to all mount
types, so fix that as well. And add some tests for some of the tricky
cases to make sure we protect against them properly.

Fixes: 331692baa7 ("Only allow proc mount if it is procfs")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:42 +11:00
Aleksa Sarai
8e8b136c49 tree-wide: use /proc/thread-self for thread-local state
With the idmap work, we will have a tainted Go thread in our
thread-group that has a different mount namespace to the other threads.
It seems that (due to some bad luck) the Go scheduler tends to make this
thread the thread-group leader in our tests, which results in very
baffling failures where /proc/self/mountinfo produces gibberish results.

In order to avoid this, switch to using /proc/thread-self for everything
that is thread-local. This primarily includes switching all file
descriptor paths (CLONE_FS), all of the places that check the current
cgroup (technically we never will run a single runc thread in a separate
cgroup, but better to be safe than sorry), and the aforementioned
mountinfo code. We don't need to do anything for the following because
the results we need aren't thread-local:

 * Checks that certain namespaces are supported by stat(2)ing
   /proc/self/ns/...

 * /proc/self/exe and /proc/self/cmdline are not thread-local.

 * While threads can be in different cgroups, we do not do this for the
   runc binary (or libcontainer) and thus we do not need to switch to
   the thread-local version of /proc/self/cgroups.

 * All of the CLONE_NEWUSER files are not thread-local because you
   cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER)
   is blocked for multi-threaded programs).

Note that we have to use runtime.LockOSThread when we have an open
handle to a tid-specific procfs file that we are operating on multiple
times. Go can reschedule us such that we are running on a different
thread and then kill the original thread (causing -ENOENT or similarly
confusing errors). This is not strictly necessary for most usages of
/proc/thread-self (such as using /proc/thread-self/fd/$n directly) since
only operating on the actual inodes associated with the tid requires
this locking, but because of the pre-3.17 fallback for CentOS, we have
to do this in most cases.

In addition, CentOS's kernel is too old for /proc/thread-self, which
requires us to emulate it -- however in rootfs_linux.go, we are in the
container pid namespace but /proc is the host's procfs. This leads to
the incredibly frustrating situation where there is no way (on pre-4.1
Linux) to figure out which /proc/self/task/... entry refers to the
current tid. We can just use /proc/self in this case.

Yes this is all pretty ugly. I also wish it wasn't necessary.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:41 +11:00
Aleksa Sarai
ba0b5e2698 libcontainer: remove all mount logic from nsexec
With open_tree(OPEN_TREE_CLONE), it is possible to implement both the
id-mapped mounts and bind-mount source file descriptor logic entirely in
Go without requiring any complicated handling from nsexec.

However, implementing it the naive way (do the OPEN_TREE_CLONE in the
host namespace before the rootfs is set up -- which is what the existing
implementation did) exposes issues in how mount ordering (in particular
when handling mount sources from inside the container rootfs, but also
in relation to mount propagation) was handled for idmapped mounts and
bind-mount sources. In order to solve this problem completely, it is
necessary to spawn a thread which joins the container mount namespace
and provides mountfds when requested by the rootfs setup code (ensuring
that the mount order and mount propagation of the source of the
bind-mount are handled correctly). While the need to join the mount
namespace leads to other complicated (such as with the usage of
/proc/self -- fixed in a later patch) the resulting code is still
reasonable and is the only real way to solve the issue.

This allows us to reduce the amount of C code we have in nsexec, as well
as simplifying a whole host of places that were made more complicated
with the addition of id-mapped mounts and the bind sourcefd logic.
Because we join the container namespace, we can continue to use regular
O_PATH file descriptors for non-id-mapped bind-mount sources (which
means we don't have to raise the kernel requirement for that case).

In addition, we can easily add support for id-mappings that don't match
the container's user namespace. The approach taken here is to use Go's
officially supported mechanism for spawning a process in a user
namespace, but (ab)use PTRACE_TRACEME to avoid actually having to exec a
different process. The most efficient way to implement this would be to
do clone() in cgo directly to run a function that just does
kill(getpid(), SIGSTOP) -- we can always switch to that if it turns out
this approach is too slow. It should be noted that the included
micro-benchmark seems to indicate this is Fast Enough(TM):

  goos: linux
  goarch: amd64
  pkg: github.com/opencontainers/runc/libcontainer/userns
  cpu: Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz
  BenchmarkSpawnProc
  BenchmarkSpawnProc-8        1670            770065 ns/op

Fixes: fda12ab101 ("Support idmap mounts on volumes")
Fixes: 9c444070ec ("Open bind mount sources from the host userns")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2023-12-14 11:36:40 +11:00
Kir Kolyshkin
efbebb39b5 libct: rename root to stateDir in struct Container
The name "root" (or "containerRoot") is confusing; one might think it is
the root of container's file system (the directory we chroot into).

Rename to stateDir for clarity.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-10-04 14:57:10 +11:00
Akihiro Suda
0a5cd69462 Merge pull request #3995 from kolyshkin/rm-unix-nolint
bump golangci-lint; remove nolint annotations for unix errno comparisons
2023-08-25 16:54:57 +09:00
Kir Kolyshkin
6a4870e4ac libct: better errors for hooks
When a hook has failed, the error message looks like this:

> error running hook: error running hook #1: exit status 1, stdout: ...

The two problems here are:
1. it is impossible to know what kind of hook it was;
2. "error running hook" stuttering;

Change that to

> error running createContainer hook #1: exit status 1, stdout: ...

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 19:44:05 -07:00
Kir Kolyshkin
f62f0bdfbf Remove nolint annotations for unix errno comparisons
golangci-lint v1.54.2 comes with errorlint v1.4.4, which contains
the fix [1] whitelisting all errno comparisons for errors coming from
x/sys/unix.

Thus, these annotations are no longer necessary. Hooray!

[1] https://github.com/polyfloyd/go-errorlint/pull/47
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-24 17:28:10 -07:00
Aleksa Sarai
f81ef1493d libcontainer: sync: cleanup synchronisation code
This includes quite a few cleanups and improvements to the way we do
synchronisation. The core behaviour is unchanged, but switching to
embedding json.RawMessage into the synchronisation structure will allow
us to do more complicated synchronisation operations in future patches.

The file descriptor passing through the synchronisation system feature
will be used as part of the idmapped-mount and bind-mount-source
features when switching that code to use the new mount API outside of
nsexec.c.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-15 19:54:24 -07:00
Kir Kolyshkin
38676931ed criu: do not add log file into error message
As we now log the log file name in logCriuErrors.

While at it, there is no need to use var.String() with %s as it is done
by the runtime.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-03 10:39:33 -07:00
Kir Kolyshkin
c77aaa3f95 criu checkpoint/restore: print errors from criu log
When criu fails, it does not give us much context to understand what
was the cause of an error -- for that, we need to take a look into its
log file.

This is somewhat complicated to do (as you can see in parts of
checkpoint.bats removed by this commit), and not very user-friendly.

Add a function to find and log errors from criu logs, together with some
preceding context, in case either checkpoint or restore has failed.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-03 10:33:20 -07:00
Kir Kolyshkin
e4478e9fff criuSwrk: simplify switch
1. Use "switch t" since we only check t.

2. Remove unneeded t assignment.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-03 10:33:20 -07:00
Kir Kolyshkin
cb981e510b libct: move criu-related stuff to separate file
No code change, only added periods to some comments to make godot happy.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2023-08-03 10:16:01 -07:00