mirror of https://github.com/opencontainers/runc.git synced 2025-09-26 19:41:35 +08:00

Files

Aleksa Sarai 121192ade6 libct: reset CPU affinity by default

In certain deployments, it's possible for runc to be spawned by a
process with a restrictive cpumask (such as from a systemd unit with
CPUAffinity=... configured) which will be inherited by runc and thus the
container process by default.

The cpuset cgroup used to reconfigure the cpumask automatically for
joining processes, but kcommit da019032819a ("sched: Enforce user
requested affinity") changed this behaviour in Linux 6.2.

The solution is to try to emulate the expected behaviour by resetting
our cpumask to correspond with the configured cpuset (in the case of
"runc exec", if the user did not configure an alternative one). Normally
we would have to parse /proc/stat and /sys/fs/cgroup, but luckily
sched_setaffinity(2) will transparently convert an all-set cpumask (even
if it has more entries than the number of CPUs on the system) to the
correct value for our usecase.

For some reason, in our CI it seems that rootless --systemd-cgroup
results in the cpuset (presumably temporarily?) being configured such
that sched_setaffinity(2) will allow the full set of CPUs. For this
particular case, all we care about is that it is different to the
original set, so include some special-casing (but we should probably
investigate this further...).

Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Martin Sivak <msivak@redhat.com>
Reported-by: Peter Hunt <pehunt@redhat.com>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

2025-08-28 08:25:46 +10:00

apparmor

libct/apparmor: don't use vars for public functions

2025-04-14 13:59:39 -07:00

capabilities

Fix staticcheck ST1020/ST1021 warnings

2025-03-25 16:06:44 -07:00

configs

libcontainer/configs/validate: check that intelrdt is enabled

2025-08-01 10:03:54 +03:00

devices

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

exeseal

ci: bump codespell to v2.4.1, fix some typos

2025-03-24 10:05:22 -07:00

integration

Use for range over integers

2025-03-31 17:15:06 -07:00

intelrdt

Merge pull request #4831 from marquiz/devel/rdt-root

2025-08-24 02:15:54 -03:00

internal/userns

ci: bump codespell to v2.4.1, fix some typos

2025-03-24 10:05:22 -07:00

keys

libct/*: remove linux build tag from some pkgs

2021-08-30 20:52:07 -07:00

logs

init: don't special-case logrus fds

2024-01-24 00:20:59 +11:00

nsenter

runc exec: implement CPU affinity

2025-03-02 19:17:41 -08:00

seccomp

build(seccomp): Add audit support for loong64

2025-07-16 09:39:11 +08:00

specconv

Add support for Linux Network Devices

2025-06-18 15:52:30 +01:00

system

int/linux: add/use Exec

2025-03-26 14:16:53 -07:00

user

deprecate libcontainer/user

2023-09-19 10:22:29 +02:00

userns

libcontainer/userns: migrate to github.com/moby/sys/userns

2024-10-09 22:20:25 +08:00

utils

Use any instead of interface{}

2025-03-31 17:15:06 -07:00

console_linux.go

int/linux: add/use Dup3, Open, Openat

2025-03-26 14:16:53 -07:00

container_linux_test.go

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

container_linux.go

Make state.json 25% smaller

2025-03-19 15:51:52 -07:00

container.go

libct: rm BaseContainer and Container interfaces

2022-03-23 11:04:12 -07:00

criu_disabled_linux.go

Add runc_nocriu build tag

2024-12-09 11:19:23 -08:00

criu_linux.go

criu: simplify isOnTmpfs check in prepareCriuRestoreMounts

2025-05-20 16:56:55 -07:00

criu_opts_linux.go

expose criu options for link remap and skip in flight

2025-02-25 10:35:31 -05:00

env_test.go

libct: Override HOME if its set to the empty string

2025-04-04 15:37:22 +02:00

env.go

libct: Override HOME if its set to the empty string

2025-04-04 15:37:22 +02:00

error.go

add ErrCgroupNotExist

2024-09-23 23:27:35 +00:00

factory_linux_test.go

Use any instead of interface{}

2025-03-31 17:15:06 -07:00

factory_linux.go

libct: State: ensure Resources is not nil

2025-06-19 10:24:16 -07:00

init_linux.go

libct: we should set envs after we are in the jail of the container

2025-04-01 15:22:29 +00:00

message_linux.go

libcontainer: remove all mount logic from nsexec

2023-12-14 11:36:40 +11:00

mount_linux_test.go

mount: add string representation of mount flags

2025-04-21 13:00:59 +10:00

mount_linux.go

mount: add string representation of mount flags

2025-04-21 13:00:59 +10:00

network_linux.go

Add support for Linux Network Devices

2025-06-18 15:52:30 +01:00

notify_linux_test.go

Remove io/ioutil use

2021-10-14 13:46:02 -07:00

notify_linux.go

Remove io/ioutil use

2021-10-14 13:46:02 -07:00

notify_v2_linux.go

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

process_linux.go

libct: reset CPU affinity by default

2025-08-28 08:25:46 +10:00

process.go

runc exec: implement CPU affinity

2025-03-02 19:17:41 -08:00

README.md

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

restored_process.go

libcontainer: remove LinuxFactory

2022-03-22 23:44:31 -07:00

rootfs_linux_test.go

libcontainer: force apps to think fips is enabled/disabled for testing

2024-04-10 18:58:34 -04:00

rootfs_linux.go

rootfs: remove /proc/net/dev from allowed overmount list

2025-07-20 15:40:37 +10:00

setns_init_linux.go

int/linux: add/use Exec

2025-03-26 14:16:53 -07:00

SPEC.md

ci/gha: add space-at-eol check, fix existing issues

2023-06-07 11:27:27 -07:00

standard_init_linux.go

int/linux: add/use Dup3, Open, Openat

2025-03-26 14:16:53 -07:00

state_linux_test.go

Use any instead of interface{}

2025-03-31 17:15:06 -07:00

state_linux.go

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

stats_linux.go

Switch to opencontainers/cgroups

2025-02-28 15:20:33 -08:00

sync_unix.go

int/linux: add/use Recvfrom

2025-03-26 14:16:53 -07:00

sync.go

Use any instead of interface{}

2025-03-31 17:15:06 -07:00

README.md

libcontainer

Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

Container

A container is a self contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.

Using libcontainer

Container init

Because containers are spawned in a two step process you will need a binary that will be executed as the init process for the container. In libcontainer, we use the current binary (/proc/self/exe) to be executed as the init process, and use arg "init", we call the first step process "bootstrap", so you always need a "init" function as the entry of "bootstrap".

In addition to the go init function the early stage bootstrap is handled by importing nsenter.

For details on how runc implements such "init", see init.go and libcontainer/init_linux.go.

Device management

If you want containers that have access to some devices, you need to import this package into your code:

    import (
        _ "github.com/opencontainers/cgroups/devices"
    )

Without doing this, libcontainer cgroup manager won't be able to set up device access rules, and will fail if devices are specified in the container configuration.

Container creation

To create a container you first have to create a configuration struct describing how the container is to be created. A sample would look similar to this:

defaultMountFlags := unix.MS_NOEXEC | unix.MS_NOSUID | unix.MS_NODEV
var devices []*devices.Rule
for _, device := range specconv.AllowedDevices {
	devices = append(devices, &device.Rule)
}
config := &configs.Config{
	Rootfs: "/your/path/to/rootfs",
	Capabilities: &configs.Capabilities{
		Bounding: []string{
			"CAP_KILL",
			"CAP_AUDIT_WRITE",
		},
		Effective: []string{
			"CAP_KILL",
			"CAP_AUDIT_WRITE",
		},
		Permitted: []string{
			"CAP_KILL",
			"CAP_AUDIT_WRITE",
		},
	},
	Namespaces: configs.Namespaces([]configs.Namespace{
		{Type: configs.NEWNS},
		{Type: configs.NEWUTS},
		{Type: configs.NEWIPC},
		{Type: configs.NEWPID},
		{Type: configs.NEWUSER},
		{Type: configs.NEWNET},
		{Type: configs.NEWCGROUP},
	}),
	Cgroups: &configs.Cgroup{
		Name:   "test-container",
		Parent: "system",
		Resources: &configs.Resources{
			MemorySwappiness: nil,
			Devices:          devices,
		},
	},
	MaskPaths: []string{
		"/proc/kcore",
		"/sys/firmware",
	},
	ReadonlyPaths: []string{
		"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
	},
	Devices:  specconv.AllowedDevices,
	Hostname: "testing",
	Mounts: []*configs.Mount{
		{
			Source:      "proc",
			Destination: "/proc",
			Device:      "proc",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "tmpfs",
			Destination: "/dev",
			Device:      "tmpfs",
			Flags:       unix.MS_NOSUID | unix.MS_STRICTATIME,
			Data:        "mode=755",
		},
		{
			Source:      "devpts",
			Destination: "/dev/pts",
			Device:      "devpts",
			Flags:       unix.MS_NOSUID | unix.MS_NOEXEC,
			Data:        "newinstance,ptmxmode=0666,mode=0620,gid=5",
		},
		{
			Device:      "tmpfs",
			Source:      "shm",
			Destination: "/dev/shm",
			Data:        "mode=1777,size=65536k",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "mqueue",
			Destination: "/dev/mqueue",
			Device:      "mqueue",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "sysfs",
			Destination: "/sys",
			Device:      "sysfs",
			Flags:       defaultMountFlags | unix.MS_RDONLY,
		},
	},
	UIDMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	GIDMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	Networks: []*configs.Network{
		{
			Type:    "loopback",
			Address: "127.0.0.1/0",
			Gateway: "localhost",
		},
	},
	Rlimits: []configs.Rlimit{
		{
			Type: unix.RLIMIT_NOFILE,
			Hard: uint64(1025),
			Soft: uint64(1025),
		},
	},
}

Once you have the configuration populated you can create a container with a specified ID under a specified state directory:

container, err := libcontainer.Create("/run/containers", "container-id", config)
if err != nil {
	logrus.Fatal(err)
	return
}

To spawn bash as the initial process inside the container and have the processes pid returned in order to wait, signal, or kill the process:

process := &libcontainer.Process{
	Args:   []string{"/bin/bash"},
	Env:    []string{"PATH=/bin"},
	User:   "daemon",
	Stdin:  os.Stdin,
	Stdout: os.Stdout,
	Stderr: os.Stderr,
	Init:   true,
}

err := container.Run(process)
if err != nil {
	container.Destroy()
	logrus.Fatal(err)
	return
}

// wait for the process to finish.
_, err := process.Wait()
if err != nil {
	logrus.Fatal(err)
}

// destroy the container.
container.Destroy()

Additional ways to interact with a running container are:

// return all the pids for all processes running inside the container.
processes, err := container.Processes()

// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()

// pause all processes inside the container.
container.Pause()

// resume all paused processes.
container.Resume()

// send signal to container's init process.
container.Signal(signal)

// update container resource constraints.
container.Set(config)

// get current status of the container.
status, err := container.Status()

// get current container's state information.
state, err := container.State()

Checkpoint & Restore

libcontainer now integrates CRIU for checkpointing and restoring containers. This lets you save the state of a process running inside a container to disk, and then restore that state into a new process, on the same machine or on another machine.

criu version 1.5.2 or higher is required to use checkpoint and restore. If you don't already have criu installed, you can build it from source, following the online instructions. criu is also installed in the docker image generated when building libcontainer with docker.

Copyright and license

Code and documentation copyright 2014 Docker, inc. The code and documentation are released under the Apache 2.0 license. The documentation is also released under Creative Commons Attribution 4.0 International License. You may obtain a copy of the license, titled CC-BY-4.0, at http://creativecommons.org/licenses/by/4.0/.