Skip to content
This repository has been archived by the owner on Feb 8, 2023. It is now read-only.

Update runc 2019 0213 #19

Merged
merged 14 commits into from
Feb 13, 2019
Merged

Update runc 2019 0213 #19

merged 14 commits into from
Feb 13, 2019

Conversation

Ace-Tang
Copy link
Collaborator

update 14 commit from opencontainer/runc.
diff is opencontainers/runc@f5b9991...0a012df

The important patch is fix cve-2019-5736
NOTE: after this merged, runc can not run on linux kernel version < 3.11, see opencontainers#1979

Tom Godkin and others added 14 commits February 13, 2019 14:22
The kernel will sometimes return EINVAL when writing a pid to a
cgroup.procs file. It does so when the task being added still has the
state TASK_NEW.

See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286

Co-authored-by: Danail Branekov <[email protected]>

Signed-off-by: Tom Godkin <[email protected]>
Signed-off-by: Danail Branekov <[email protected]>
CRIU 3.11 introduces configuration files:

https://criu.org/Configuration_files
https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html

This enables the user to influence CRIU's behaviour without code changes
if using new CRIU features or if the user wants to enable certain CRIU
behaviour without always specifying certain options.

With this it is possible to write 'tcp-established' to the configuration
file:

$ echo tcp-established > /etc/criu/runc.conf

and from now on all checkpoints will preserve the state of established
TCP connections. This removes the need to always use

$ runc checkpoint --tcp-stablished

If the goal is to always checkpoint with '--tcp-established'

It also adds the possibility for unexpected CRIU behaviour if the user
created a configuration file at some point in time and forgets about it.

As a result of the discussion in opencontainers#1933
it is now also possible to define a CRIU configuration file for each
container with the annotation 'org.criu.config'.

If 'org.criu.config' does not exist, runc will tell CRIU to use
'/etc/criu/runc.conf' if it exists.

If 'org.criu.config' is set to an empty string (''), runc will tell CRIU
to not use any runc specific configuration file at all.

If 'org.criu.config' is set to a non-empty string, runc will use that
value as an additional configuration file for CRIU.

With the annotation the user can decide to use the default configuration
file ('/etc/criu/runc.conf'), none or a container specific configuration
file.

Signed-off-by: Adrian Reber <[email protected]>
For the newly integrated feature to use CRIU configuration files the
test is broken without an additional CRIU patch.

The test changes CRIU's log file. Changing the log file is unfortunately
the only thing which is in broken in CRIU 3.11. But it is the easiest
option for testing. With CRIU 3.12 this will be fixed. All other CRIU
options can be changed with a CRIU configuration file.

With this change the CRIU 3.11 feature can be merged into runc with a
test and for the user it should just work, if they are not trying to
change CRIU's log file.

Signed-off-by: Adrian Reber <[email protected]>
This patch fixes a corner case when destroy a container:

If we start a container without 'intelRdt' config set, and then we run
“runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config
implicitly.

Now if we enter "exit" from the container inside, we will pass through
linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy().
But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null
string, it hasn’t been initialized yet. As a result, the created rdt
group directory during "runc update" will not be removed as expected.

Signed-off-by: Xiaochen Shen <[email protected]>
since commit df3fa11 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.

Skip enabling kernel memory if there are already tasks in the cgroup.

Without this patch, runc fails with:

container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""

Signed-off-by: Giuseppe Scrivano <[email protected]>
When creating a new user namespace, the kernel doesn't allow to mount
a new procfs or sysfs file system if there is not already one instance
fully visible in the current mount namespace.

When using --no-pivot we were effectively inhibiting this protection
from the kernel, as /proc and /sys from the host are still present in
the container mount namespace.

A container without full access to /proc could then create a new user
namespace, and from there able to mount a fully visible /proc, bypassing
the limitations in the container.

A simple reproducer for this issue is:

unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger"

Signed-off-by: Giuseppe Scrivano <[email protected]>
This just copies the latest output from 'runc checkpoint --help' to the
man page.

Signed-off-by: Adrian Reber <[email protected]>
For some reason, libcontainer/integration has a whole bunch of incorrect
usages of libcontainer.Factory -- causing test failures with a set of
security patches that will be published soon. Fixing ths is fairly
trivial (switch to creating a new libcontainer.Factory once in each
process, rather than creating one in TestMain globally).

Signed-off-by: Aleksa Sarai <[email protected]>
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <[email protected]>
Signed-off-by: Aleksa Sarai <[email protected]>
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 7 committers have signed the CLA.

❌ BooleanCat
❌ adrianreber
❌ xiaochenshen
❌ giuseppe
❌ cyphar
❌ filbranden
❌ jhowardmsft
You have signed the CLA already but the status is still pending? Let us recheck it.

@allencloud
Copy link

@fuweid
Please take a look at of this pr. And try to cover the CVE fix.

@fuweid
Copy link

fuweid commented Feb 13, 2019

thanks @allencloud @Ace-Tang . wait for CI.

@Ace-Tang
Copy link
Collaborator Author

alibaba/runc project should trigger alibaba/pouch ci test

@fuweid
Copy link

fuweid commented Feb 13, 2019

LGTM

@fuweid fuweid merged commit 4df7cda into develop Feb 13, 2019
@fuweid fuweid deleted the update_2019_0213 branch February 13, 2019 07:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants