This updates handling of capabilities to match the updated runtime specification,
in https://github.com/opencontainers/runtime-spec/pull/1094.
Prior to that change, the specification required runtimes to produce a (fatal)
error if a container configuration requested capabilities that could not be
granted (either the capability is "unknown" to the runtime, not supported by the
kernel version in use, or not available in the environment that the runtime
operates in).
This caused problems in situations where the runtime was running in a restricted
environment (for example, docker-in-docker), or if there is a mismatch between
the list of capabilities known by higher-level runtimes and the OCI runtime.
Some examples:
- Kernel 5.8 introduced CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE
capabilities. Docker 20.10.0 ("higher level runtime") shipped with
an updated list of capabilities, and when creating a "privileged" container,
would determine what capabilities are known by the kernel in use, and request
all those capabilities (by including them in the container config).
However, runc did not yet have an updated list of capabilities, and therefore
reject the container specification, producing an error because the new
capabilities were "unknown".
- When running nested containers, for example, when running docker-in-docker,
the "inner" container may be using a more recent version of docker than the
"outer" container. In this situation, the "outer" container may be missing
capabilities that the inner container expects to be supported (based on
kernel version). However, starting the container would fail, because the OCI
runtime could not grant those capabilities (them not being available in the
environment it's running in).
WARN (but otherwise ignore) capabilities that cannot be granted
--------------------------------------------------------------------------------
This patch changes the handling to WARN (but otherwise ignore) capabilities that
are requested in the container config, but cannot be granted, alleviating higher
level runtimes to detect what capabilities are supported (by the kernel, and
in the current environment), as well as avoiding failures in situations where
the higher-level runtime is aware of capabilities that are not (yet) supported
by runc.
Impact on security
--------------------------------------------------------------------------------
Given that `capabilities` is an "allow-list", ignoring unknown capabilities does
not impose a security risk; worst case, a container does not get all requested
capabilities granted and, as a result, some actions may fail.
Backward-compatibility
--------------------------------------------------------------------------------
This change should be fully backward compatible. Higher-level runtimes that
already dynamically adjust the list of requested capabilities can continue to do
so. Runtimes that do not adjust will see an improvement (containers can start
even if some of the requested capabilities are not granted). Container processes
MAY fail (as described in "impact on security"), but users can debug this
situation either by looking at the warnings produces by the OCI runtime, or using
tools such as `capsh` / `libcap` to get the list of actual capabilities in the
container.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>