zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-11-01 11:32:44 +08:00

Author	SHA1	Message	Date
Aleksa Sarai	cdce249635	merge branch 'pr-3057' Fraser Tweedale (1): chown cgroup to process uid in container namespace LGTMs: kolyshkin cyphar Closes #3057	2021-12-07 17:06:19 +11:00
Akihiro Suda	520702dac5	Add `runc features` command Fix issue 3274 See `types/features/features.go`. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2021-11-30 16:40:39 +09:00
Fraser Tweedale	35d20c4e0b	chown cgroup to process uid in container namespace Delegating cgroups to the container enables more complex workloads, including systemd-based workloads. The OCI runtime-spec was recently updated to explicitly admit such delegation, through specification of cgroup ownership semantics: https://github.com/opencontainers/runtime-spec/pull/1123 Pursuant to the updated OCI runtime-spec, change the ownership of the container's cgroup directory and particular files therein, when using cgroups v2 and when the cgroupfs is to be mounted read/write. As a result of this change, systemd workloads can run in isolated user namespaces on OpenShift when the sandbox's cgroupfs is mounted read/write. It might be possible to implement this feature in other cgroup managers, but that work is deferred. Signed-off-by: Fraser Tweedale <ftweedal@redhat.com>	2021-11-30 08:52:59 +10:00
Aleksa Sarai	dde509df4e	specconv: do not permit null bytes in mount fields Using null bytes as control characters for sending strings via netlink opens us up to a user explicitly putting a null byte in a mount string (which JSON will happily let you do) and then causing us to open a mount path different to the one expected. In practice this is more of an issue in an environment such as Kubernetes where you may have path-based access control policies (which are more susceptible to these kinds of flaws). Found by Google Project Zero. Fixes: `9c444070ec` ("Open bind mount sources from the host userns") Reported-by: Felix Wilhelm <fwilhelm@google.com> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-11-19 11:41:05 +11:00
Kir Kolyshkin	643f8a2b40	libct/specconv: nits 1. Decapitalize errors. 2. Rename isValidName to checkPropertyName. 3. Make it return a specific error. Suggested-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-17 17:32:28 -08:00
Kir Kolyshkin	029b73c1b0	libct/spec: replace isValidName regex with a function Also, add a simple test and a benchmark (just out of sheer curiosity). Benchmark results: name old time/op new time/op delta IsValidName-4 540ns ± 3% 45ns ± 1% -91.76% (p=0.008 n=5+5) Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:23:49 -08:00
Kir Kolyshkin	6907becaf9	libct/specconv: remove isSecSuffix regex Commit `1cd71dfd7` added isSecSuffix, but the same thing can be done easily without a regex. This is faster and saves some init time and memory. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:23:47 -08:00
Kir Kolyshkin	37c5fd554e	libct/specconv: make parseMountOptions return Mount parseMountOption already returns way too many values, making the code kind of hard to read. Since all of the return values are used as is to populate the fields of configs.Mount, let's change it to return (semi-)populated *configs.Mount instead. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:23:30 -08:00
Kir Kolyshkin	2c3792baf9	libct/specconv: make mountFlags and extensionFlags global This makes the repeated calls to parseMountOptions faster, and decreases the amount of garbage to collect. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:16:04 -08:00
Kir Kolyshkin	81586e1935	libct/specconv: reuse mountPropagationMapping in parseMountOptions These two maps are the same, except that mountPropagationMapping has an extra element with key of "" and value of 0. Since the code already checks for f != 0, this extra element is not a problem. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:15:54 -08:00
Kir Kolyshkin	8fe1e8bf8c	libct/specconv: rm some init allocations Eliminate some of these allocations when starting runc: > init github.com/opencontainers/runc/libcontainer/specconv @10 ms, 0.11 ms clock, 5408 bytes, 70 allocs Most of this (4K) is the two regexes, which are left intact for now. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-12 20:15:37 -08:00
Neil Johnson	2e0ceaa935	fix createDevices when no Linux section Signed-off-by: Neil Johnson <najohnsn@us.ibm.com>	2021-10-04 17:37:19 -04:00
Kir Kolyshkin	097c6d7425	libct/cg: simplify getting cgroup manager 1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). [v2: can't -> cannot] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 09:11:44 -07:00
Alban Crequy	2b025c0173	Implement Seccomp Notify This commit implements support for the SCMP_ACT_NOTIFY action. It requires libseccomp-2.5.0 to work but runc still works with older libseccomp if the seccomp policy does not use the SCMP_ACT_NOTIFY action. A new synchronization step between runc[INIT] and runc run is introduced to pass the seccomp fd. runc run fetches the seccomp fd with pidfd_get from the runc[INIT] process and sends it to the seccomp agent using SCM_RIGHTS. As suggested by @kolyshkin, we also make writeSync() a wrapper of writeSyncWithFd() and wrap the error there. To avoid pointless errors, we made some existing code paths just return the error instead of re-wrapping it. If we don't do it, error will look like: writing syncT <act>: writing syncT: <err> By adjusting the code path, now they just look like this writing syncT <act>: <err> Signed-off-by: Alban Crequy <alban@kinvolk.io> Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> Co-authored-by: Rodrigo Campos <rodrigo@kinvolk.io>	2021-09-07 13:04:24 +02:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
Kir Kolyshkin	3c7db3827c	Merge pull request #2883 from flouthoc/master Add support for rdma cgroup introduced in Linux Kernel 4.11	2021-08-30 20:02:04 -07:00
flouthoc	b3d14488b5	Add support for rdma cgroup introduced in Linux Kernel 4.11 Signed-off-by: Aditya Rajan <flouthoc.git@gmail.com>	2021-08-23 12:25:33 +05:30
Markus Lehtonen	17e3b41dd0	libcontainer/intelrdt: support ClosID parameter Handle ClosID parameter of IntelRdt. Makes it possible to use pre-configured classes/ClosIDs and avoid running out of available IDs which easily happens with per-container classes. Remove validator checks for empty L3CacheSchema and MemBwSchema fields in order to be able to leave them empty, and only specify ClosID for a pre-configured class. Signed-off-by: Markus Lehtonen <markus.lehtonen@intel.com>	2021-08-09 15:58:03 +03:00
Kir Kolyshkin	82498e3d77	libct/specconf: remove unneeded checks In cases we have something like if y != "" { x = y } where both x and y are strings, and x was not set before, it makes no sense to have a condition, as such code is equivalent to mere x = y Simplify such cases by removing "if". Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-28 12:45:28 -07:00
Kir Kolyshkin	7be93a66b9	*: fmt.Errorf: use %w when appropriate This should result in no change when the error is printed, but make the errors returned unwrappable, meaning errors.As and errors.Is will work. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	627a06ad92	Replace fmt.Errorf w/o %-style to errors.New Using fmt.Errorf for errors that do not have %-style formatting directives is an overkill. Switch to errors.New. Found by git grep fmt.Errorf \| grep -v ^vendor \| grep -v '%' Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 11:42:07 -07:00
Sebastiaan van Stijn	b31a9340f9	libcontainer: relax validation for absolute paths Commits `1f1e91b1a0` and `2192670a24` added validation for mountpoints to be an absolute path, to match the OCI specs. Unfortunately, the old behavior (accepting the path to be a relative path) has been around for a long time, and although "not according to the spec", various higher level runtimes rely on this behavior. While higher level runtime have been updated to address this requirement, there will be a transition period before all runtimes are updated to carry these fixes. This patch relaxes the validation, to generate a WARNING instead of failing, allowing runtimes to update (but allowing them to update runc to the current version, which includes security fixes). We can remove this exception in a future patch release. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2021-06-09 13:20:28 +02:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Giuseppe Scrivano	c61f606254	libcontainer: honor seccomp defaultErrnoRet https://github.com/opencontainers/runtime-spec/pull/1087 added support for defaultErrnoRet to the OCI runtime specs. If a defaultErrnoRet is specified, disable patching the generated libseccomp cBPF. Closes: https://github.com/opencontainers/runc/issues/2943 Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2021-05-17 09:23:32 +02:00
Kir Kolyshkin	1f1e91b1a0	libct/specconv: check mount destination is absolute Per OCI runtime spec, mount destination MUST be absolute. Let's check that and return an error if not. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-20 11:26:16 -07:00
Qiang Huang	2d38476c96	Merge pull request #2840 from kolyshkin/ignore-kmem Ignore kernel memory settings	2021-04-13 09:44:14 +08:00
Kir Kolyshkin	52390d6804	Ignore kernel memory settings This is somewhat radical approach to deal with kernel memory. Per-cgroup kernel memory limiting was always problematic. A few examples: - older kernels had bugs and were even oopsing sometimes (best example is RHEL7 kernel); - kernel is unable to reclaim the kernel memory so once the limit is hit a cgroup is toasted; - some kernel memory allocations don't allow failing. In addition to that, - users don't have a clue about how to set kernel memory limits (as the concept is much more complicated than e.g. [user] memory); - different kernels might have different kernel memory usage, which is sort of unexpected; - cgroup v2 do not have a [dedicated] kmem limit knob, and thus runc silently ignores kernel memory limits for v2; - kernel v5.4 made cgroup v1 kmem.limit obsoleted (see https://github.com/torvalds/linux/commit/0158115f702b). In view of all this, and as the runtime-spec lists memory.kernel and memory.kernelTCP as OPTIONAL, let's ignore kernel memory limits (for cgroup v1, same as we're already doing for v2). This should result in less bugs and better user experience. The only bad side effect from it might be that stat can show kernel memory usage as 0 (since the accounting is not enabled). [v2: add a warning in specconv that limits are ignored] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-12 12:18:11 -07:00
Kir Kolyshkin	27bb1bd5ea	libct/specconv/CreateCgroupConfig: don't set c.Parent default c.Parent is only used by systemd cgroup drivers, and both v1 and v2 drivers do have code to set the default if it is empty, so setting it here is redundant. In addition, in case of cgroup v2 rootless container setting it here is harmful as the default should be user.slice not system.slice. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-01 19:50:37 -07:00
Kir Kolyshkin	f0dec0b4bf	libct/specconv/CreateCgroupConfig: nit Do not call libcontainerUtils.CleanPath in case its result will not be used. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-04-01 19:33:01 -07:00
Iceber Gu	fa52df9493	libcontainer: fix the file mode of the device Signed-off-by: Iceber Gu <wei.cai-nat@daocloud.io>	2021-02-17 15:08:22 +08:00
Aleksa Sarai	7a8d7162f9	seccomp: prepend -ENOSYS stub to all filters Having -EPERM is the default was a fairly significant mistake from a future-proofing standpoint in that it makes any new syscall return a non-ignorable error (from glibc's point of view). We need to correct this now because faccessat2(2) is something glibc critically needs to have support for, but they're blocked on container runtimes because we return -EPERM unconditionally (leading to confusion in glibc). This is also a problem we're probably going to keep running into in the future. Unfortunately there are several issues which stop us from having a clean solution to this problem: 1. libseccomp has several limitations which require us to emulate behaviour we want: a. We cannot do logic based on syscall number, meaning we cannot specify a "largest known syscall number"; b. libseccomp doesn't know in which kernel version a syscall was added, and has no API for "minimum kernel version" so we cannot simply ask libseccomp to generate sane -ENOSYS rules for us. c. Additional seccomp rules for the same syscall are not treated as distinct rules -- if rules overlap, seccomp will merge them. This means we cannot add per-syscall -EPERM fallbacks; d. There is no inverse operation for SCMP_CMP_MASKED_EQ; e. libseccomp does not allow you to specify multiple rules for a single argument, making it impossible to invert OR rules for arguments. 2. The runtime-spec does not have any way of specifying: a. The errno for the default action; b. The minimum kernel version or "newest syscall at time of profile creation"; nor c. Which syscalls were intentionally excluded from the allow list (weird syscalls that are no longer used were excluded entirely, but Docker et al expect those syscalls to get EPERM not ENOSYS). 3. Certain syscalls should not return -ENOSYS (especially only for certain argument combinations) because this could also trigger glibc confusion. This means we have to return -EPERM for certain syscalls but not as a global default. 4. There is not an obvious (and reasonable) upper limit to syscall numbers, so we cannot create a set of rules for each syscall above the largest syscall number in libseccomp. This means we must handle inverse rules as described below. 5. Any syscall can be specified multiple times, which can make generation of hotfix rules much harder. As a result, we have to work around all of these things by coming up with a heuristic to stop the bleeding. In the future we could hopefully improve the situation in the runtime-spec and libseccomp. The solution applied here is to prepend a "stub" filter which returns -ENOSYS if the requested syscall has a larger syscall number than any syscall mentioned in the filter. The reason for this specific rule is that syscall numbers are (roughly) allocated sequentially and thus newer syscalls will (usually) have a larger syscall number -- thus causing our filters to produce -ENOSYS if the filter was written before the syscall existed. Sadly this is not a perfect solution because syscalls can be added out-of-order and the syscall table can contain holes for several releases. Unfortuntely we do not have a nicer solution at the moment because there is no library which provides information about which Linux version a syscall was introduced in. Until that exists, this workaround will have to be good enough. The above behaviour only happens if the default action is a blocking action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the default action is permissive then we don't do any patching. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2021-01-28 23:11:22 +11:00
Sebastiaan van Stijn	4fc2de77e9	libcontainer/devices: remove "Device" prefix from types Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:23 +01:00
Sebastiaan van Stijn	677baf22d2	libcontainer: isolate libcontainer/devices Move the Device-related types to libcontainer/devices, so that the package can be used in isolation. Aliases have been created in libcontainer/configs for backward compatibility. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-12-01 11:11:21 +01:00
Ashok Pon Kumar	fcf210d631	Fix goreport warnings of ineffassign and misspell Signed-off-by: Ashok Pon Kumar <ashokponkumar@gmail.com>	2020-10-02 09:03:45 +05:30
Kenta Tada	3d5dec2f44	libcontainer: remove the unused variable from spec Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2020-10-01 12:29:48 +09:00
Sebastiaan van Stijn	8bf216728c	use string-concatenation instead of sprintf for simple cases Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-09-30 10:51:59 +02:00
Kir Kolyshkin	b006f4a180	libct/cgroups: support Cgroups.Resources.Unified Add support for unified resource map (as per [1]), and add some test cases for the new functionality. [1] https://github.com/opencontainers/runtime-spec/pull/1040 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-09-24 15:29:35 -07:00
Giuseppe Scrivano	a63f99fcc5	Add support for umask Signed-off-by: Ashley Cui <acui@redhat.com>	2020-08-20 11:39:43 -04:00
Cesar Talledo	0709202da7	Remove runc default devices that overlap with spec devices. Runc has a set of default devices that it includes in Linux containers (e.g., /dev/null, /dev/random, /dev/tty, etc.) However if the container's OCI spec includes all or a subset of those same devices, runc is currently not detecting the redundancy, causing it to create a lib container config that has redundant device configurations. This causes a failure in rootless mode, in particular when the /dev/tty device has a redundant config: container_linux.go:370: starting container process caused: process_linux.go:459: container init caused: rootfs_linux.go:70: creating device nodes caused: open /tmp/busyboxtest/rootfs/dev/tty: no such device or address" The reason this fails in rootless mode only is that in this case runc sets up /dev/tty not by doing mknod (it's not allowed within a user-ns) but rather by creating a regular file under /dev/tty and bind-mounting the host's /dev/tty to the container's /dev/tty. When this operation is done redundantly, it fails the second time. This change fixes this problem by ensuring runc checks for redundant devices between the OCI spec it receives and the default devices it configures. If a redundant device is detected, the OCI spec takes priority. The change adds both a unit test and an integration test to verify the behavior. Without this fix, this new integration test fails as shown above. Signed-off-by: Cesar Talledo <ctalledo@nestybox.com>	2020-08-07 16:46:15 -07:00
Renaud Gaubert	ccdd75760c	Add the CreateRuntime, CreateContainer and StartContainer Hooks Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>	2020-06-17 02:10:00 +00:00
Kir Kolyshkin	4189cb65f8	cgroups: remove cgroup.Resources.CpuMax This (and the converting function) is only used by one of the four cgroup drivers. The other three do some checking and conversion in place, so let the fs2 do the same. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-06-09 17:15:38 -07:00
Giuseppe Scrivano	41aa19662b	libcontainer: honor seccomp errnoRet Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2020-05-20 09:11:55 +02:00
Aleksa Sarai	24388be71e	configs: use different types for .Devices and .Resources.Devices Making them the same type is simply confusing, but also means that you could accidentally use one in the wrong context. This eliminates that problem. This also includes a whole bunch of cleanups for the types within DeviceRule, so that they can be used more ergonomically. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Aleksa Sarai	60e21ec26e	specconv: remove default /dev/console access /dev/console is a host resouce which gives a bunch of permissions that we really shouldn't be giving to containers, not to mention that /dev/console in containers is actually /dev/pts/$n. Drop this since arguably this is a fairly scary thing to allow... Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Aleksa Sarai	b2bec9806f	cgroup: devices: eradicate the Allow/Deny lists These lists have been in the codebase for a very long time, and have been unused for a large portion of that time -- specconv doesn't generate them and the only user of these flags has been tests (which doesn't inspire much confidence). In addition, we had an incorrect implementation of a white-list policy. This wasn't exploitable because all of our users explicitly specify "deny all" as the first rule, but it was a pretty glaring issue that came from the "feature" that users can select whether they prefer a white- or black- list. Fix this by always writing a deny-all rule (which is what our users were doing anyway, to work around this bug). This is one of many changes needed to clean up the devices cgroup code. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2020-05-13 17:38:45 +10:00
Pradyumna Agrawal	4aa9101477	Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig The change ensures that the passed in value of NoNewPrivileges under spec.Process is reflected in the container config generated by specconv.CreateLibcontainerConfig Closes #2397 Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>	2020-05-11 13:38:14 -07:00
lifubang	d2a9c5da37	using default allowed devices when linux resources is null Signed-off-by: lifubang <lifubang@acmcoder.com>	2020-04-16 11:40:44 +08:00
Akihiro Suda	cc183ca662	Merge pull request #2242 from AkihiroSuda/vendor-systemd vendor: update go-systemd and godbus	2020-03-25 02:40:22 +09:00
Akihiro Suda	492d525e55	vendor: update go-systemd and godbus Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-03-16 13:26:03 +09:00
Akihiro Suda	aa269315a4	cgroup2: add CpuMax conversion Fix #2243 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-03-13 02:58:39 +09:00

1 2 3

123 Commits