This was added in2ee9cbbd12("It's /proc/stat, not /proc/stats") with no actual justification, and doesn't really make much sense on further inspection: * /proc/net is a symlink to "self/net", which means that /proc/net/dev is a per-process file, and so overmounting it would only affect pid1. Any other program that cares about /proc/net/dev would see their own process's configuration, and unprivileged processes wouldn't be able to see /proc/1/... data anyway. In addition, the fact that this is a symlink means that runc will deny the overmount because /proc/1/net/dev is not in the proc overmount allowlist. This means that this has not worked for many years, and probably never worked in the first place. * /proc/self/net is already namespaced with network namespaces, so the primary argument for allowing /proc overmounts (lxcfs-like masking of procfs files to emulate namespacing for files that are not properly namespaced for containers -- such as /proc/cpuinfo) is moot. It goes without saying that lxcfs has never overmounted /proc/self/net/... files, so the general "because lxcfs" justification doesn't hold water either. * The kernel has slowly been moving towards blocking overmounts in /proc/self/. Linux 6.12 blocked overmounts for fd, fdinfo, and map_files; future Linux versions will probably end up blocking everything under /proc/self/. Fixes:2ee9cbbd12("It's /proc/stat, not /proc/stats") Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
libcontainer
Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.
Container
A container is a self contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.
Using libcontainer
Container init
Because containers are spawned in a two step process you will need a binary that will be executed as the init process for the container. In libcontainer, we use the current binary (/proc/self/exe) to be executed as the init process, and use arg "init", we call the first step process "bootstrap", so you always need a "init" function as the entry of "bootstrap".
In addition to the go init function the early stage bootstrap is handled by importing nsenter.
For details on how runc implements such "init", see init.go and libcontainer/init_linux.go.
Device management
If you want containers that have access to some devices, you need to import this package into your code:
import (
_ "github.com/opencontainers/cgroups/devices"
)
Without doing this, libcontainer cgroup manager won't be able to set up device access rules, and will fail if devices are specified in the container configuration.
Container creation
To create a container you first have to create a configuration struct describing how the container is to be created. A sample would look similar to this:
defaultMountFlags := unix.MS_NOEXEC | unix.MS_NOSUID | unix.MS_NODEV
var devices []*devices.Rule
for _, device := range specconv.AllowedDevices {
devices = append(devices, &device.Rule)
}
config := &configs.Config{
Rootfs: "/your/path/to/rootfs",
Capabilities: &configs.Capabilities{
Bounding: []string{
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
Effective: []string{
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
Permitted: []string{
"CAP_KILL",
"CAP_AUDIT_WRITE",
},
},
Namespaces: configs.Namespaces([]configs.Namespace{
{Type: configs.NEWNS},
{Type: configs.NEWUTS},
{Type: configs.NEWIPC},
{Type: configs.NEWPID},
{Type: configs.NEWUSER},
{Type: configs.NEWNET},
{Type: configs.NEWCGROUP},
}),
Cgroups: &configs.Cgroup{
Name: "test-container",
Parent: "system",
Resources: &configs.Resources{
MemorySwappiness: nil,
Devices: devices,
},
},
MaskPaths: []string{
"/proc/kcore",
"/sys/firmware",
},
ReadonlyPaths: []string{
"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
},
Devices: specconv.AllowedDevices,
Hostname: "testing",
Mounts: []*configs.Mount{
{
Source: "proc",
Destination: "/proc",
Device: "proc",
Flags: defaultMountFlags,
},
{
Source: "tmpfs",
Destination: "/dev",
Device: "tmpfs",
Flags: unix.MS_NOSUID | unix.MS_STRICTATIME,
Data: "mode=755",
},
{
Source: "devpts",
Destination: "/dev/pts",
Device: "devpts",
Flags: unix.MS_NOSUID | unix.MS_NOEXEC,
Data: "newinstance,ptmxmode=0666,mode=0620,gid=5",
},
{
Device: "tmpfs",
Source: "shm",
Destination: "/dev/shm",
Data: "mode=1777,size=65536k",
Flags: defaultMountFlags,
},
{
Source: "mqueue",
Destination: "/dev/mqueue",
Device: "mqueue",
Flags: defaultMountFlags,
},
{
Source: "sysfs",
Destination: "/sys",
Device: "sysfs",
Flags: defaultMountFlags | unix.MS_RDONLY,
},
},
UIDMappings: []configs.IDMap{
{
ContainerID: 0,
HostID: 1000,
Size: 65536,
},
},
GIDMappings: []configs.IDMap{
{
ContainerID: 0,
HostID: 1000,
Size: 65536,
},
},
Networks: []*configs.Network{
{
Type: "loopback",
Address: "127.0.0.1/0",
Gateway: "localhost",
},
},
Rlimits: []configs.Rlimit{
{
Type: unix.RLIMIT_NOFILE,
Hard: uint64(1025),
Soft: uint64(1025),
},
},
}
Once you have the configuration populated you can create a container with a specified ID under a specified state directory:
container, err := libcontainer.Create("/run/containers", "container-id", config)
if err != nil {
logrus.Fatal(err)
return
}
To spawn bash as the initial process inside the container and have the processes pid returned in order to wait, signal, or kill the process:
process := &libcontainer.Process{
Args: []string{"/bin/bash"},
Env: []string{"PATH=/bin"},
User: "daemon",
Stdin: os.Stdin,
Stdout: os.Stdout,
Stderr: os.Stderr,
Init: true,
}
err := container.Run(process)
if err != nil {
container.Destroy()
logrus.Fatal(err)
return
}
// wait for the process to finish.
_, err := process.Wait()
if err != nil {
logrus.Fatal(err)
}
// destroy the container.
container.Destroy()
Additional ways to interact with a running container are:
// return all the pids for all processes running inside the container.
processes, err := container.Processes()
// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()
// pause all processes inside the container.
container.Pause()
// resume all paused processes.
container.Resume()
// send signal to container's init process.
container.Signal(signal)
// update container resource constraints.
container.Set(config)
// get current status of the container.
status, err := container.Status()
// get current container's state information.
state, err := container.State()
Checkpoint & Restore
libcontainer now integrates CRIU for checkpointing and restoring containers. This lets you save the state of a process running inside a container to disk, and then restore that state into a new process, on the same machine or on another machine.
criu version 1.5.2 or higher is required to use checkpoint and restore.
If you don't already have criu installed, you can build it from source, following the
online instructions. criu is also installed in the docker image
generated when building libcontainer with docker.
Copyright and license
Code and documentation copyright 2014 Docker, inc. The code and documentation are released under the Apache 2.0 license. The documentation is also released under Creative Commons Attribution 4.0 International License. You may obtain a copy of the license, titled CC-BY-4.0, at http://creativecommons.org/licenses/by/4.0/.