zishuo/runc

mirror of https://github.com/opencontainers/runc.git synced 2025-10-05 07:27:03 +08:00

Author	SHA1	Message	Date
Kir Kolyshkin	177c7d4f59	Fix codespell warnings ./features.go:30: tru ==> through, true ... ./utils_linux.go:147: infront ==> in front Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2024-05-23 22:02:45 -07:00
utam0k	bfbd0305ba	Add I/O priority Signed-off-by: utam0k <k0ma@utam0k.jp>	2024-03-30 22:31:54 +09:00
Aleksa Sarai	8e8b136c49	tree-wide: use /proc/thread-self for thread-local state With the idmap work, we will have a tainted Go thread in our thread-group that has a different mount namespace to the other threads. It seems that (due to some bad luck) the Go scheduler tends to make this thread the thread-group leader in our tests, which results in very baffling failures where /proc/self/mountinfo produces gibberish results. In order to avoid this, switch to using /proc/thread-self for everything that is thread-local. This primarily includes switching all file descriptor paths (CLONE_FS), all of the places that check the current cgroup (technically we never will run a single runc thread in a separate cgroup, but better to be safe than sorry), and the aforementioned mountinfo code. We don't need to do anything for the following because the results we need aren't thread-local: * Checks that certain namespaces are supported by stat(2)ing /proc/self/ns/... * /proc/self/exe and /proc/self/cmdline are not thread-local. * While threads can be in different cgroups, we do not do this for the runc binary (or libcontainer) and thus we do not need to switch to the thread-local version of /proc/self/cgroups. * All of the CLONE_NEWUSER files are not thread-local because you cannot set the usernamespace of a single thread (setns(CLONE_NEWUSER) is blocked for multi-threaded programs). Note that we have to use runtime.LockOSThread when we have an open handle to a tid-specific procfs file that we are operating on multiple times. Go can reschedule us such that we are running on a different thread and then kill the original thread (causing -ENOENT or similarly confusing errors). This is not strictly necessary for most usages of /proc/thread-self (such as using /proc/thread-self/fd/$n directly) since only operating on the actual inodes associated with the tid requires this locking, but because of the pre-3.17 fallback for CentOS, we have to do this in most cases. In addition, CentOS's kernel is too old for /proc/thread-self, which requires us to emulate it -- however in rootfs_linux.go, we are in the container pid namespace but /proc is the host's procfs. This leads to the incredibly frustrating situation where there is no way (on pre-4.1 Linux) to figure out which /proc/self/task/... entry refers to the current tid. We can just use /proc/self in this case. Yes this is all pretty ugly. I also wish it wasn't necessary. Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-12-14 11:36:41 +11:00
Kir Kolyshkin	7396ca90fa	runc delete: do not ignore error from destroy If container.Destroy() has failed, runc destroy still return 0, which is wrong and can result in other issues down the line. Let's always return error from destroy in runc delete. For runc checkpoint and runc run, we still treat it as a warning. Co-authored-by: Zhang Tianyang <burning9699@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2023-11-27 09:15:39 -08:00
lfbzhm	95a93c132c	Merge pull request #4045 from fuweid/support-pidfd-socket [feature request] *: introduce pidfd-socket flag	2023-11-22 09:13:55 +08:00
Wei Fu	94505a046a	*: introduce pidfd-socket flag The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely. And for the standard init process, we can have polling support to get exit event instead of blocking on wait4. Signed-off-by: Wei Fu <fuweid89@gmail.com>	2023-11-21 18:28:50 +08:00
Aleksa Sarai	7c71a22705	rootfs: remove --no-mount-fallback and finally fix MS_REMOUNT The original reasoning for this option was to avoid having mount options be overwritten by runc. However, adding command-line arguments has historically been a bad idea because it forces strict-runc-compatible OCI runtimes to copy out-of-spec features directly from runc and these flags are usually quite difficult to enable by users when using runc through several layers of engines and orchestrators. A far more preferable solution is to have a heuristic which detects whether copying the original mount's mount options would override an explicit mount option specified by the user. In this case, we should return an error. You only end up in this path in the userns case, if you have a bind-mount source with locked flags. During the course of writing this patch, I discovered that several aspects of our handling of flags for bind-mounts left much to be desired. We have completely botched the handling of explicitly cleared flags since commit `97f5ee4e6a` ("Only remount if requested flags differ from current"), with our behaviour only becoming increasingly more weird with `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") and `da780e4d27` ("Fix bind mounts of filesystems with certain options set"). In short, we would only clear flags explicitly request by the user purely by chance, in ways that it really should've been reported to us by now. The most egregious is that mounts explicitly marked "rw" were actually mounted "ro" if the bind-mount source was "ro" and no other special flags were included. In addition, our handling of atime was completely broken -- mostly due to how subtle the semantics of atime are on Linux. Unfortunately, while the runtime-spec requires us to implement mount(8)'s behaviour, several aspects of the util-linux mount(8)'s behaviour are broken and thus copying them makes little sense. Since the runtime-spec behaviour for this case (should mount options for a "bind" mount use the "mount --bind -o ..." or "mount --bind -o remount,..." semantics? Is the fallback code we have for userns actually spec-compliant?) and the mount(8) behaviour (see [1]) are not well-defined, this commit simply fixes the most obvious aspects of the behaviour that are broken while keeping the current spirit of the implementation. NOTE: The handling of atime in the base case is left for a future PR to deal with. This means that the atime of the source mount will be silently left alone unless the fallback path needs to be taken, and any flags not explicitly set will be cleared in the base case. Whether we should always be operating as "mount --bind -o remount,..." (where we default to the original mount source flags) is a topic for a separate PR and (probably) associated runtime-spec PR. So, to resolve this: * We store which flags were explicitly requested to be cleared by the user, so that we can detect whether the userns fallback path would end up setting a flag the user explicitly wished to clear. If so, we return an error because we couldn't fulfil the configuration settings. * Revert `97f5ee4e6a` ("Only remount if requested flags differ from current"), as missing flags do not mean we can skip MS_REMOUNT (in fact, missing flags are how you indicate a flag needs to be cleared with mount(2)). The original purpose of the patch was to fix the userns issue, but as mentioned above the correct mechanism is to do a fallback mount that copies the lockable flags from statfs(2). * Improve handling of atime in the fallback case by: - Correctly handling the returned flags in statfs(2). - Implement the MNT_LOCK_ATIME checks in our code to ensure we produce errors rather than silently producing incorrect atime mounts. * Improve the tests so we correctly detect all of these contingencies, including a general "bind-mount atime handling" test to ensure that the behaviour described here is accurate. This change also inlines the remount() function -- it was only ever used for the bind-mount remount case, and its behaviour is very bind-mount specific. [1]: https://github.com/util-linux/util-linux/issues/2433 Reverts: `97f5ee4e6a` ("Only remount if requested flags differ from current") Fixes: `50105de1d8` ("Fix failure with rw bind mount of a ro fuse") Fixes: `da780e4d27` ("Fix bind mounts of filesystems with certain options set") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>	2023-10-24 17:28:25 +11:00
utam0k	770728e16e	Support `process.scheduler` Spec: https://github.com/opencontainers/runtime-spec/pull/1188 Fix: https://github.com/opencontainers/runc/issues/3895 Co-authored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: utam0k <k0ma@utam0k.jp> Signed-off-by: lifubang <lifubang@acmcoder.com>	2023-10-04 15:53:18 +08:00
Ruediger Pluem	da780e4d27	Fix bind mounts of filesystems with certain options set Currently bind mounts of filesystems with nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime options set fail in rootless mode if the same options are not set for the bind mount. For ro filesystems this was resolved by #2570 by remounting again with ro set. Follow the same approach for nodev, nosuid, noexec, noatime, relatime, strictatime, nodiratime but allow to revert back to the old behaviour via the new `--no-mount-fallback` command line option. Add a testcase to verify that bind mounts of filesystems with nodev, nosuid, noexec, noatime options set work in rootless mode. Add a testcase that mounts a nodev, nosuid, noexec, noatime filesystem with a ro flag. Add two further testcases that ensure that the above testcases would fail if the `--no-mount-fallback` command line option is set. * contrib/completions/bash/runc: Add `--no-mount-fallback` command line option for bash completion. * create.go: Add `--no-mount-fallback` command line option. * restore.go: Add `--no-mount-fallback` command line option. * run.go: Add `--no-mount-fallback` command line option. * libcontainer/configs/config.go: Add `NoMountFallback` field to the `Config` struct to store the command line option value. * libcontainer/specconv/spec_linux.go: Add `NoMountFallback` field to the `CreateOpts` struct to store the command line option value and store it in the libcontainer config. * utils_linux.go: Store the command line option value in the `CreateOpts` struct. * libcontainer/rootfs_linux.go: In case that `--no-mount-fallback` is not set try to remount the bind filesystem again with the options nodev, nosuid, noexec, noatime, relatime, strictatime or nodiratime if they are set on the source filesystem. * tests/integration/mounts_sshfs.bats: Add testcases and rework sshfs setup to allow specifying different mount options depending on the test case. Signed-off-by: Ruediger Pluem <ruediger.pluem@vodafone.com>	2023-07-28 16:32:02 -07:00
Kir Kolyshkin	102b8abd26	libct: rm BaseContainer and Container interfaces The only implementation of these is linuxContainer. It does not make sense to have an interface with a single implementation, and we do not foresee other types of containers being added to runc. Remove BaseContainer and Container interfaces, moving their methods documentation to linuxContainer. Rename linuxContainer to Container. Adopt users from using interface to using struct. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-23 11:04:12 -07:00
Kir Kolyshkin	6a3fe1618f	libcontainer: remove LinuxFactory Since LinuxFactory has become the means to specify containers state top directory (aka --root), and is only used by two methods (Create and Load), it is easier to pass root to them directly. Modify all the users and the docs accordingly. While at it, fix Create and Load docs (those that were originally moved from the Factory interface docs). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-03-22 23:44:31 -07:00
Kir Kolyshkin	40b0088681	loadFactory: remove Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-18 16:05:29 -08:00
Kir Kolyshkin	36786c361a	list, utils: remove redundant code The value of root is already an absolute path since commit `ede8a86ec1`, so it does not make sense to call filepath.Abs() again. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-18 16:05:29 -08:00
Kir Kolyshkin	dbd990d555	libct: rm intelrtd.Manager interface, NewIntelRdtManager Remove intelrtd.Manager interface, since we only have a single implementation, and do not expect another one. Rename intelRdtManager to Manager, and modify its users accordingly. Remove NewIntelRdtManager from factory. Remove IntelRdtfs. Instead, make intelrdt.NewManager return nil if the feature is not available. Remove TestFactoryNewIntelRdt as it is now identical to TestFactoryNew. Add internal function newManager to be used for tests (to make sure some testing is done even when the feature is not available in kernel/hardware). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 17:33:03 -08:00
Kir Kolyshkin	39bd7b7217	libct: Container, Factory: rm newuidmap/newgidmap These were introduced in commit `d8b669400` back in 2017, with a TODO of "make binary names configurable". Apparently, everyone is happy with the hardcoded names. In fact, they are configurable (by prepending the PATH with a directory containing own version of newuidmap/newgidmap). Now, these binaries are only needed in a few specific cases (when rootless is set etc.), so let's look them up only when needed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-02-03 11:40:29 -08:00
Kir Kolyshkin	6e1d476aad	runc: remove --criu option This was introduced in an initial commit, back in the day when criu was a highly experimental thing. Today it's not; most users who need it have it packaged by their distro vendor. The usual way to run a binary is to look it up in directories listed in $PATH. This is flexible enough and allows for multiple scenarios (custom binaries, extra binaries, etc.). This is the way criu should be run. Make --criu a hidden option (thus removing it from help). Remove the option from man pages, integration tests, etc. Remove all traces of CriuPath from data structures. Add a warning that --criu is ignored and will be removed. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2022-01-26 20:25:56 -08:00
Kir Kolyshkin	86733013cc	notify_socket: setupSpec: drop ctx arg and return value Those were never used (ctx was added by the initial commit, and error was added by commit `25fd4a6757`). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	3648346572	tty: ClosePostStart: rm return value It is not and was not ever used. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	f3f4b6d155	tty: recvtty: rm process arg It is not used since commit `00a0ecf554`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	e63186351b	tty: rm inheritStdio return value Since commit `eebdb644f9` this function never returns any error. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	d23b810927	checkpoint: rm getDefaultImagePath arg It was never used. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-11-29 20:10:22 -08:00
Kir Kolyshkin	0202c398ff	runc exec: implement --cgroup In some setups, multiple cgroups are used inside a container, and sometime there is a need to execute a process in a particular sub-cgroup (in case of cgroup v1, for a particular controller). This is what this commit implements. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-27 10:25:42 -07:00
Kir Kolyshkin	097c6d7425	libct/cg: simplify getting cgroup manager 1. Make Rootless and Systemd flags part of config.Cgroups. 2. Make all cgroup managers (not just fs2) return error (so it can do more initialization -- added by the following commits). 3. Replace complicated cgroup manager instantiation in factory_linux by a single (and simple) libcontainer/cgroups/manager.New() function. 4. getUnifiedPath is simplified to check that only a single path is supplied (rather than checking that other paths, if supplied, are the same). [v2: can't -> cannot] Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-23 09:11:44 -07:00
Kir Kolyshkin	9ba2f65d6b	startContainer: minor refactor All three callers* of startContainer call revisePidFile and createSpec before calling it, so it makes sense to move those calls to inside of the startContainer, and drop the spec argument. * -- in fact restore does not call revisePidFile, but it should. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-14 10:53:11 -07:00
Kir Kolyshkin	6c4a3b13d1	runc init: pass _LIBCONTAINER_LOGLEVEL as int Instead of passing _LIBCONTAINER_LOGLEVEL as a string (like "debug" or "info"), use a numeric value. Also, simplify the init log level passing code -- since we actually use the same level as the runc binary, just get it from logrus. This is a preparation for the next commit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-09 14:57:20 -07:00
Kir Kolyshkin	0a3577c680	utils_linux: simplify newProcess newProcess do not need those extra arguments, they can be handled in the caller. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-09-09 14:57:20 -07:00
Akihiro Suda	5fb9b2a006	Merge pull request #3185 from kolyshkin/go117-build-tags Add go:build tags	2021-09-02 13:35:33 +09:00
Kir Kolyshkin	9ff64c3d97	*: rm redundant linux build tag For files that end with _linux.go or _linux_test.go, there is no need to specify linux build tag, as it is assumed from the file name. In addition, rename libcontainer/notify_linux_v2.go -> libcontainer/notify_v2_linux.go for the file name to make sense. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-08-30 20:15:00 -07:00
lifubang	cb824629ba	proposal: add --keep to runc run Signed-off-by: lifubang <lifubang@acmcoder.com>	2021-08-02 12:51:36 -07:00
Kir Kolyshkin	a7cfb23b88	*: stop using pkg/errors Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-22 16:09:47 -07:00
Kir Kolyshkin	e6048715e4	Use gofumpt to format code gofumpt (mvdan.cc/gofumpt) is a fork of gofmt with stricter rules. Brought to you by git ls-files \*.go \| grep -v ^vendor/ \| xargs gofumpt -s -w Looking at the diff, all these changes make sense. Also, replace gofmt with gofumpt in golangci.yml. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-06-01 12:17:27 -07:00
Kir Kolyshkin	719d70d2e3	setupIO: simplify code There's no need to check err for nil. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2021-01-07 13:33:41 -08:00
Xiaochen Shen	325a74ddec	libcontainer/intelrdt: rm init() from intelrdt.go Use sync.Once to init Intel RDT when needed for a small speedup to operations which do not require Intel RDT. Simplify IntelRdtManager initialization in LinuxFactory. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2020-12-16 23:37:31 +08:00
Xiaochen Shen	f62ad4a0de	libcontainer/intelrdt: rename CAT and MBA enabled flags Rename CAT and MBA enabled flags to be consistent with others. No functional change. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2020-11-10 15:32:01 +08:00
Sebastiaan van Stijn	8bf216728c	use string-concatenation instead of sprintf for simple cases Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2020-09-30 10:51:59 +02:00
Kir Kolyshkin	b79cb04886	runc run/exec: fix terminal wrt stdin redirection This fixes the following failure: > sudo runc run -b bundle ctr </dev/null > WARN[0000] exit status 2 > ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:459: container init caused: The "exit status 2" with no error message is caused by SIGHUP which is sent to init by the kernel when we are losing the controlling terminal. If we choose to ignore that, we'll get panic in console.Current(), which is addressed by [1]. Otherwise, the issue here is simple: the code assumes stdin is opened to a terminal, and fails to work otherwise. Some standard Linux tools (e.g. stty, top) do the same (modulo panic), while some others (reset, tput) use the trick of trying all the three std streams (starting with stderr as it is least likely to be redirected), and if all three fails, open /dev/tty. This commit does a similar thing (see initHostConsole). It also replaces the call to console.Current(), which may panic (see [1]), by reusing the t.hostConsole. Finally, a simple test case is added. Fixes: https://github.com/opencontainers/runc/issues/2485 [1] https://github.com/containerd/console/pull/37 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-08-20 07:56:20 -07:00
zvier	92e2175de1	cleancode: clean code for utils_linux.go Signed-off-by: Jeff Zvier <zvier20@gmail.com>	2020-07-23 06:12:27 +08:00
John Hwang	5aa0601a59	validateProcessSpec: prevent SEGV when config is valid json, but invalid. Signed-off-by: John Hwang <John.F.Hwang@gmail.com>	2020-05-18 09:38:22 -07:00
John Hwang	7fc291fd45	Replace formatted errors when unneeded Signed-off-by: John Hwang <John.F.Hwang@gmail.com>	2020-05-16 18:13:21 -07:00
Kir Kolyshkin	2b31437caa	Merge pull request #2281 from AkihiroSuda/rootless-systemd cgroup v2: support rootless systemd LGTMs: kolyshkin, mrunalp	2020-05-07 21:45:52 -07:00
Akihiro Suda	bf15cc99b1	cgroup v2: support rootless systemd Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 . $ podman --cgroup-manager=systemd run -it --rm --runtime=runc \ --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine / # cat /proc/self/cgroup 0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max 44040192 / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max 42000 100000 / # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max 42 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-05-08 12:39:20 +09:00
Kir Kolyshkin	c52a598d74	Remove fatalf() It was only used in one place, all others are happy with `fatal(fmt.Errorf())`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2020-05-02 16:19:14 -07:00
Mrunal Patel	33c6125da6	systemd: Export IsSystemdRunning() function Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2020-03-30 15:24:06 -07:00
Akihiro Suda	cc183ca662	Merge pull request #2242 from AkihiroSuda/vendor-systemd vendor: update go-systemd and godbus	2020-03-25 02:40:22 +09:00
Ted Yu	0a7762c664	Avoid duplicate calls to runner#destroy Signed-off-by: Ted Yu <yuzhihong@gmail.com>	2020-03-23 09:04:38 -07:00
Akihiro Suda	492d525e55	vendor: update go-systemd and godbus Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2020-03-16 13:26:03 +09:00
Giuseppe Scrivano	25fd4a6757	sd-notify: do not hang when NOTIFY_SOCKET is used with create if NOTIFY_SOCKET is used, do not block the main runc process waiting for events on the notify socket. Bind mount the parent directory of the notify socket, so that "start" can create the socket and it is still accessible from the container. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2020-03-12 21:21:05 +01:00
Mrunal Patel	eb4aeed24f	Merge pull request #2038 from imxyb/defer-destroy `r.destroy` can defer exec in `runner.run` method.	2019-05-07 15:48:14 -07:00
Georgi Sabev	ba3cabf932	Improve nsexec logging * Simplify logging function * Logs contain __FUNCTION__:__LINE__ * Bail uses write_log Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-22 17:53:52 +03:00
Xiao YongBiao	da5a2dd456	`r.destroy` can defer exec in `runner.run` method. Signed-off-by: Xiao YongBiao <xyb4638@gmail.com>	2019-04-10 23:25:03 +08:00

1 2

100 Commits