Commit Graph

2844 Commits

Author SHA1 Message Date
Evan Phoenix
28475f12e3 Retry direct unix package calls if observing EINTR
Retry Recvfrom, Sendmsg, Readmsg, and Read as they can return EINTR.

Signed-off-by: Evan Phoenix <evan@phx.io>
2025-02-21 15:19:54 -08:00
Rodrigo Campos
b0b186e64d Merge pull request #4630 from kolyshkin/clean-path
libc/utils: simplify CleanPath
2025-02-13 13:59:23 -03:00
lfbzhm
c3c111d2a6 Merge pull request #4585 from kolyshkin/per-process-properties
Fix process/config properties merging
2025-02-13 18:47:32 +08:00
Rodrigo Campos
20727c62d5 Merge pull request #4598 from kolyshkin/go124
Add Go 1.24, drop Go 1.22
2025-02-13 07:46:47 -03:00
Kir Kolyshkin
0f88286077 Merge pull request #4470 from kolyshkin/strings-cut
Use strings.Cut and strings.CutPrefix where possible
2025-02-12 23:35:20 -08:00
Kir Kolyshkin
8db6ffbeef libc/utils: simplify CleanPath
This simplifies the code flow and basically removes the last
filepath.Clean, which is not necessary in either case:

 - for absolute path, single filepath.Clean is enough (as it is
   guaranteed to remove all dot and dot-dot elements);

 - for relative path, filepath.Rel calls Clean at the end
   (which is even documented).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-12 20:17:51 -08:00
Kir Kolyshkin
35a28ad0a4 Merge pull request #4596 from evanphx/evanphx/b-close-range
utils: Handle close_range more gracefully
2025-02-11 18:04:07 -08:00
Kir Kolyshkin
16d7336791 Require Go 1.23.x, drop Go 1.22 support
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:03:06 -08:00
Kir Kolyshkin
99f9ed94dc runc exec: fix setting process.Scheduler
Commit 770728e1 added Scheduler field into both Config and Process,
but forgot to add a mechanism to actually use Process.Scheduler.
As a result, runc exec does not set Process.Scheduler ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
b9114d91e2 runc exec: fix setting process.ioPriority
Commit bfbd0305b added IOPriority field into both Config and Process,
but forgot to add a mechanism to actually use Process.IOPriority.
As a result, runc exec does not set Process.IOPriority ever.

Fix it, and a test case (which fails before the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
73849e797f libct: simplify Caps inheritance
For all other properties that are available in both Config and Process,
the merging is performed by newInitConfig.

Let's do the same for Capabilities for the sake of code uniformity.

Also, thanks to the previous commit, we no longer have to make sure we
do not call capabilities.New(nil).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
049a5f76cf libct/cap: allow New(nil)
In runtime-spec, capabilities property is optional, but
libcontainer/capabilities panics when New(nil) is called.

Because of this, there's a kludge in finalizeNamespace to ensure
capabilities.New is not called with nil argument, and there's a
TestProcessEmptyCaps to ensure runc won't panic.

Let's fix this at the source, allowing libct/cap to work with nil
capabilities.

(The caller is fixed by the next commit.)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
f26ec92221 libct: rm Rootless* properties from initConfig
They are passed in initConfig twice, so it does not make sense.

NB: the alternative to that would be to remove Config field from
initConfig, but it results in a much bigger patch and more maintenance
down the road.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
Kir Kolyshkin
2a86c35768 libct: document initConfig and friends
This is one of the dark corners of runc / libcontainer, so let's shed
some light on it.

initConfig is a structure which is filled in [mostly] by newInitConfig,
and one of its hidden aspects is it contains a process config which is
the result of merge between the container and the process configs.

Let's document how all this happens, where the fields are coming from,
which one has a preference, and how it all works.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-11 18:01:30 -08:00
lfbzhm
74b35d8927 Merge pull request #4592 from kolyshkin/exec-nits
Improvements to how `runc exec` is handled
2025-02-10 18:32:33 +08:00
lfbzhm
bf0f67f7f2 Merge pull request #4597 from evanphx/evanphx/b-graceful-ambient
capabilities: be more graceful in resetting ambient
2025-02-10 18:28:52 +08:00
Kir Kolyshkin
c4eb0c61e1 libct: createExecFifo: optimize
Every time we call container.Config(), a new copy of
struct Config is created and returned, and we do it twice here.

Accessing container.config directly fixes this.

Fixes: 805b8c73d ("Do not create exec fifo in factory.Create")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 22:56:11 -08:00
Kir Kolyshkin
746a5c23c9 libcontainer/configs/validate: improve rootlessEUIDMount
1. Avoid splitting mount data into []string if it does not contain
   options we're interested in. This should result in slightly less
   garbage to collect.

2. Use if / else if instead of continue, to make it clearer that
   we're processing one option at a time.

3. Print the whole option as a sting in an error message; practically
   this should not have any effect, it's just simpler.

4. Improve some comments.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:47:23 -08:00
Kir Kolyshkin
055041e874 libct: use strings.CutPrefix where possible
Using strings.CutPrefix (available since Go 1.20) instead of
strings.HasPrefix and/or strings.TrimPrefix makes the code
a tad more straightforward.

No functional change.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:35 -08:00
Kir Kolyshkin
259b71c042 libct/utils: stripRoot: rm useless HasPrefix
Using strings.HasPrefix with strings.TrimPrefix results in doing the
same thing (checking if prefix exists) twice. In this case, using
strings.TrimPrefix right away is sufficient.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:35 -08:00
Kir Kolyshkin
ecf74300c0 libct/cg/fscommon: GetCgroupParam*: unify
1. GetCgroupParamUint: drop strings.TrimSpace since it was already
   done by GetCgroupParamString.

2. GetCgroupParamInt: use GetCgroupParamString, drop strings.TrimSpace.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:35 -08:00
Kir Kolyshkin
ef983f5180 libct/cg/fscommon: ParseKeyValue: stricter check
It makes sense to report an error if a key or a value is empty,
as we don't expect anything like this.

Reported-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:34 -08:00
Kir Kolyshkin
d83d533bba libct/cg/fscommon: GetValueByKey: use strings.CutPrefix
Using strings.CutPrefix (added in Go 1.20, see [1]) results in faster and
cleaner code with less allocations (as the code only allocates memory
for the value, and does it once).

While at it, improve the function documentation.

[1]: https://github.com/golang/go/issues/42537

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:34 -08:00
Kir Kolyshkin
f134871206 libct/cg/fscommon: ParseKeyValue: use strings.Cut
Using strings.Cut (added in Go 1.18, see [1]) results in faster and
cleaner code with less allocations (as we're not using a slice).

[1]: https://github.com/golang/go/issues/46336

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
e9855bdae9 libct/cg/fscommon: use strings.Cut in RDMA parser
Using strings.Cut (added in Go 1.18, see [1]) results in faster and
cleaner code with less allocations (as we're not using a slice).

Also, use switch in parseRdmaKV.

[1]: https://github.com/golang/go/issues/46336

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
930cd4944a libct/cg/fs2: use strings.Cut in parsePSIData
Using strings.Cut (added in Go 1.18, see [1]) results in faster and
cleaner code with less allocations (as we're not using a slice).

This code is tested by TestStatCPUPSI.

[1]: https://github.com/golang/go/issues/46336

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
40ce69cc9e libct/cg/fs2: use strings.Cut in setUnified
Using strings.Cut (added in Go 1.18, see [1]) results in faster and
cleaner code with less allocations (as we're not using a slice).

The code is tested by testCgroupResourcesUnified.

[1]: https://github.com/golang/go/issues/46336

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
037668e501 libct/cg/fs2: simplify parseCgroupFromReader
For cgroup v2, we always expect /proc/$PID/cgroup contents like this:

> 0::/user.slice/user-1000.slice/user@1000.service/app.slice/vte-spawn-f71c3fb8-519d-4e2d-b13e-9252594b1e05.scope

So, it does not make sense to parse it using strings.Split, we can just
cut the prefix and return the rest.

Code tested by TestParseCgroupFromReader.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
075cea3a45 libcontainer/cgroups/fs: some refactoring
Remove extra global constants that are only used in a single place and
make it harder to read the code.

Rename nanosecondsInSecond -> nsInSec.

This code is tested by unit tests.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
4271ecf73f libct/cg/fs: refactor getCpusetStat
Using strings.Cut (added in Go 1.18, see [1]) results in faster and
cleaner code with less allocations (as we're not using a slice). This
also drops the check for extra dash (we're unlikely to get it from the
kernel anyway).

While at it, rename min/max -> from/to to avoid collision with Go
min/max builtins.

This code is tested by TestCPUSetStats* tests.

[1]: https://github.com/golang/go/issues/46336

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Kir Kolyshkin
bfcd479c5d libct/cg/fs: getPercpuUsage: rm TODO
Nowadays strings.Fields are as fast as strings.SplitN so remove TODO.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 19:42:02 -08:00
Akihiro Suda
e6c4c9c680 Merge pull request #4616 from kolyshkin/fix-cross-build
libc/int/userns: add build tag to C file
2025-02-07 11:37:34 +09:00
Akihiro Suda
dadea505df Merge pull request #4612 from kolyshkin/fix-systemd-reload
libct/cg/sd: set the DeviceAllow property before DevicePolicy
2025-02-07 11:37:10 +09:00
Kir Kolyshkin
52f702af56 libct: earlier Rootless vs AdditionalGroups check
Since the UID/GID/AdditonalGroups fields are now numeric,
we can address the following TODO item in the code (added
by commit d2f49696 back in 2016):

> TODO: We currently can't do
> this check earlier, but if libcontainer.Process.User was typesafe
> this might work.

Move the check to much earlier phase, when we're preparing
to start a process in a container.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Kir Kolyshkin
7dc2486889 libct: switch to numeric UID/GID/groups
This addresses the following TODO in the code (added back in 2015
by commit 845fc65e5):

> // TODO: fix libcontainer's API to better support uid/gid in a typesafe way.

Historically, libcontainer internally uses strings for user, group, and
additional (aka supplementary) groups.

Yet, runc receives those credentials as part of runtime-spec's process,
which uses integers for all of them (see [1], [2]).

What happens next is:

1. runc start/run/exec converts those credentials to strings (a User
   string containing "UID:GID", and a []string for additional GIDs) and
   passes those onto runc init.
2. runc init converts them back to int, in the most complicated way
   possible (parsing container's /etc/passwd and /etc/group).

All this conversion and, especially, parsing is totally unnecessary,
but is performed on every container exec (and start).

The only benefit of all this is, a libcontainer user could use user and
group names instead of numeric IDs (but runc itself is not using this
feature, and we don't know if there are any other users of this).

Let's remove this back and forth translation, hopefully increasing
runc exec performance.

The only remaining need to parse /etc/passwd is to set HOME environment
variable for a specified UID, in case $HOME is not explicitly set in
process.Env. This can now be done right in prepareEnv, which simplifies
the code flow a lot. Alas, we can not use standard os/user.LookupId, as
it could cache host's /etc/passwd or the current user (even with the
osusergo tag).

PS Note that the structures being changed (initConfig and Process) are
never saved to disk as JSON by runc, so there is no compatibility issue
for runc users.

Still, this is a breaking change in libcontainer, but we never promised
that libcontainer API will be stable (and there's a special package
that can handle it -- github.com/moby/sys/user). Reflect this in
CHANGELOG.

For 3998.

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/config.md#posix-platform-user
[2]: https://github.com/opencontainers/runtime-spec/blob/v1.0.2/specs-go/config.go#L86

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-06 17:49:17 -08:00
Brad Davidson
ccb589bd7d libc/int/userns: add build tag to C file
This fixes k3s cross-compilation on Windows, broken by commit
1912d5988b ("*: actually support joining a userns with a new
container").

[@kolyshkin: commit message]

Fixes: 1912d5988b
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-05 14:09:58 -08:00
Kir Kolyshkin
d84388ae10 libct/cg/sd: set the DeviceAllow property before DevicePolicy
Every unit created by runc need daemon reload since systemd v230.
This breaks support for NVIDIA GPUs, see
https://github.com/opencontainers/runc/issues/3708#issuecomment-2216967210

A workaround is to set DeviceAllow before DevicePolicy.

Also:
 - add a test case (which fails before the fix) by @kolyshkin
 - better explain why we need empty DeviceAllow (by @cyphar)

Fixes 4568.

Reported-by: Jian Wen <wenjianhn@gmail.com>
Co-authored-by: Jian Wen <wenjianhn@gmail.com>
Co-authored-by: Aleksa Sarai <cyphar@cyphar.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-02-05 12:16:43 -08:00
Evan Phoenix
54fa0c5577 capabilities: be more graceful in resetting ambient
Similar to when SetAmbient() can fail, runc should be graceful about
ResetAmbient failing.

This functionality previously worked under gvisor, which doesn't
implement ambient capabilities atm. The hard error on reset broke gvisor
usage.

Signed-off-by: Evan Phoenix <evan@phx.io>
2025-02-04 16:28:06 -08:00
Kir Kolyshkin
6c9ddcc648 libct: switch from libct/devices to libct/cgroups/devices/config
Use the old package name as an alias to minimize the patch.

No functional change; this just eliminates a bunch of deprecation
warnings.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-31 16:51:09 -08:00
Kir Kolyshkin
200f56315e libct/devices: move config to libct/cg/devices/config
Currently, libcontainer/devices contains two things:

1. Device-related configuration data structures and accompanying
   methods. Those are used by runc itself, mostly by libct/cgroups.

2. A few functions (HostDevices, DeviceFromPath, GetDevices).
   Those are not used by runc directly, but have some external users
   (cri-o, microsoft/hcsshim), and they also have a few forks
   (containerd/pkg/oci, podman/pkg/util).

This commit moves (1) to a new separate package, config (under
libcontainer/cgroups/devices), adding a backward-compatible aliases
(marked as deprecated so we will be able to remove those later).

Alas it's not possible to move this to libcontainer/cgroups directly
because some IDs (Type, Rule, Permissions) are too generic, and renaming
them (to DeviceType, DeviceRule, DevicePermissions) will break backward
compatibility (mostly due to Rule being embedded into Device).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-31 16:51:09 -08:00
Aleksa Sarai
70e500e7d1 deps: update to github.com/cyphar/filepath-securejoin@v0.4.1
This release includes a minor breaking API change that requires us to
rework the types of our wrappers, but there is no practical behaviour
change.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2025-01-28 22:33:16 +11:00
Evan Phoenix
33315a0548 libcontainer: if close_range fails, fall back to the old way
Signed-off-by: Evan Phoenix <evan@phx.io>
2025-01-24 11:54:37 -08:00
Evan Phoenix
111e8dcc0d libcontainer: Use MaxInt32 as the last FD to match kernel size semantics
Signed-off-by: Evan Phoenix <evan@phx.io>
2025-01-24 11:53:36 -08:00
Evan Phoenix
7b26da9ee3 libcontainer: Prevent startup hang when CloseExecFrom errors
The previous logic caused runc to hang if CloseExecFrom returned an
error, as the defer waiting on logsDone never finished as the parent
process was never started (and it controls the closing of logsDone via
it's logsPipe).

This moves the defer to after we have started the parent, with means all
the logic related to managing the logsPipe should also be running.

Signed-off-by: Evan Phoenix <evan@phx.io>
2025-01-21 10:01:19 -08:00
Kir Kolyshkin
06f1e07655 libct: speedup process.Env handling
The current implementation sets all the environment variables passed in
Process.Env in the current process, one by one, then uses os.Environ to
read those back.

As pointed out in [1], this is slow, as runc calls os.Setenv for every
variable, and there may be a few thousands of those. Looking into how
os.Setenv is implemented, it is indeed slow, especially when cgo is
enabled.

Looking into why it was implemented the way it is, I found commit
9744d72c and traced it to [2], which discusses the actual reasons.
It boils down to these two:

 - HOME is not passed into container as it is set in setupUser by
   os.Setenv and has no effect on config.Env;
 - there is a need to deduplicate the environment variables.

Yet it was decided in [2] to not go ahead with this patch, but
later [3] was opened with the carry of this patch, and merged.

Now, from what I see:

1. Passing environment to exec is way faster than using os.Setenv and
   os.Environ (tests show ~20x speed improvement in a simple Go test,
   and ~3x improvement in real-world test, see below).
2. Setting environment variables in the runc context may result is some
   ugly side effects (think GODEBUG, LD_PRELOAD, or _LIBCONTAINER_*).
3. Nothing in runtime spec says that the environment needs to be
   deduplicated, or the order of preference (whether the first or the
   last value of a variable with the same name is to be used). We should
   stick to what we have in order to maintain backward compatibility.

So, this patch:
 - switches to passing env directly to exec;
 - adds deduplication mechanism to retain backward compatibility;
 - takes care to set PATH from process.Env in the current process
   (so that supplied PATH is used to find the binary to execute),
   also to retain backward compatibility;
 - adds HOME to process.Env if not set;
 - ensures any StartContainer CommandHook entries with no environment
   set explicitly are run with the same environment as before. Thanks
   to @lifubang who noticed that peculiarity.

The benchmark added by the previous commit shows ~3x improvement:

	                │   before    │                after                 │
	                │   sec/op    │    sec/op     vs base                │
	ExecInBigEnv-20   61.53m ± 1%   21.87m ± 16%  -64.46% (p=0.000 n=10)

[1]: https://github.com/opencontainers/runc/pull/1983
[2]: https://github.com/docker-archive/libcontainer/pull/418
[3]: https://github.com/docker-archive/libcontainer/pull/432

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
6171da6005 libct/configs: add HookList.SetDefaultEnv
1. Make CommandHook.Command a pointer, which reduces the amount of data
   being copied when using hooks, and allows to modify command hooks.

2. Add SetDefaultEnv, which is to be used by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
390641d148 libct/int: improve TestExecInEnvironment
This is a slight refactor of TestExecInEnvironment, making it more
strict wrt checking the exec output.

1. Explain why DEBUG is added twice to the env.
2. Reuse the execEnv for the check.
3. Make the check more strict -- instead of looking for substrings,
   check line by line.
4. Add a check for extra environment variables.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
9a54594752 libct/int: add BenchmarkExecInBigEnv
Here's what it shows on my laptop (with -count 10 -benchtime 10s,
summarized by benchstat):

	                │   sec/op    │
	ExecTrue-20       8.477m ± 2%
	ExecInBigEnv-20   61.53m ± 1%

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-09 18:22:53 +08:00
Kir Kolyshkin
83350c24a9 libct/system: rm Fexecve
This helper was added for runc-dmz in commit dac417174, but runc-dmz was
later removed in commit 871057d, which forgot to remove the helper.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-03 13:57:05 -08:00
lfbzhm
d48d9cfefc Merge pull request #4459 from kolyshkin/prio-nits
Fixups to scheduler/priority settings
2024-12-25 23:41:27 +08:00