cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617

hallyn · 2016-03-03T05:11:05Z

In a cgroup namespace, you can mount cgroupfs, and your namespace
root (say /docker1) becomes the root of the cgroup filesystem. This
shows up as field 3 in the mountinfo. This is unfortunately
ambiguous with a cgroupfs bind mount, and in this case we cannot use
that root as a prefix for our cgroup, since as far as we are concerned
our cgroup is '/', not '/docker1'.

So if cgroup namespaces are enabled (/proc/$$/ns/cgroup exists), then
assume that we haven't done any silly cgroupfs bind mount trickery,
and ignore the fs root field.

Signed-off-by: Serge Hallyn [email protected]

mrunalp · 2016-03-03T06:30:23Z

LGTM

mrunalp · 2016-03-03T06:37:22Z

@hallyn Are cgroups namespaces expected to be in 4.6? (BTW, thanks for your upstream work in reviving the patchset).

hallyn · 2016-03-03T07:05:14Z

@mrunalp That's the hope :) They're in linux-next at the moment.

mrunalp · 2016-03-03T07:12:44Z

Awesome :) Looking forward to it!

Sent from my iPhone

On Mar 2, 2016, at 11:05 PM, Serge Hallyn [email protected] wrote:

@mrunalp That's the hope :) They're in linux-next at the moment.

—
Reply to this email directly or view it on GitHub.

mlaventure · 2016-03-03T15:57:11Z

Shouldn't this also check that cgroups ns was enabled for that container? Or does this work even in the initial cgroup namespace?

hallyn · 2016-03-03T20:51:41Z

How I understand this code to work is, if the cgroupfs mount is a bind mount into a container, and the host's /sys/fs/cgroup/freezer/lxc/c1 was bind mounted into the container such that /sys/fs/cgroup/freezer in the container is a tmpfs, 'lxc/c1' are directories in the tmpfs, and the bind mount is onto /sys/fs/cgroup/freezer/lxc/c1, then the container will find that it is in cgroup '/lxc/c1', so to create /lxc/c1/x1, it must create 'x1' under the mountpoint. So it concatenates the mountpoint, /sys/fs/cgroup/freezer/lxc/c1, with the mount prefix removed from the wanted path (/lxc/c1/x1 - /lxc/c1 = x1). Simply put, when cgroup namespaces are enabled, this gimmick is not needed at all because the container was able to mount a real cgroupfs. Now unfortunately there is no way to absolutely tell which case we have going. In both cases the fs shows up as 'cgroup'. In both cases it remains possible that current's cgroup start with the mount prefix: with cgroup namespaces, I could be in absolute cgroup /lxc/c1/lxc/c1, namespaced /lxc/c1. So in this patch I simply assume that if cgroup namespaces are available, then they will be used.

hqhq · 2016-03-05T07:06:53Z

I simply assume that if cgroup namespaces are available, then they will be used

I'm afraid we don't have this assumption now, namespaces are configurable, this assumption probably would break backward compatibility on new kernel, right?

cyphar · 2016-03-07T12:47:15Z

@hqhq I'm fairly namespace joining used to be kernel-version backwards compatible (we don't join namespaces which don't exist and only emit warnings). I'm not sure now (reading over the code, it doesn't look like that's the case anymore).

hqhq · 2016-03-07T13:56:50Z

@cyphar I'm not worry about the kernel, but concern we can't use old configs.json on kernel which support cgroup namespace, because old json file has no cgroup ns configed, but runC will just join it and change the cgroup root.
Maybe not a big deal because we are not 1.0 yet, just think we can have a better way to check if we should ignore mount.Root.

If we gonna do a follow up PR to fix that, this PR LGTM.

mlaventure · 2016-03-07T17:52:48Z

I don't think runC should automatically join namespaces that are not listed in the config. That would be quite un-intuitive and limit some valid use cases.

If I'm not mistaken, atm we only join the namespaces that are actually listed within the config file. Which means that this PR should probably be updated to also check that a cgroup ns was joined before changing the value of m.Root.

mrunalp · 2016-03-07T17:56:52Z

I think we should probably wait for cgroups namespaces to be available before we merge this and other required support to runc.

cyphar · 2016-03-08T09:02:49Z

@hqhq Sorry, when I said "kernel-version backwards compatible" I was referring to runc being compatible with older kernels. I'm not sure that this is the case anymore (reading the code for joining namespaces, it looks like we error out if we can't join any one of them). But yeah, I understand what you meant.

If our cgroups were mounted in a cgroup namespace rooted at /a/b, then a task in namespaced cgroup / will see / in /proc/self/mountinfo and /sys/fs/cgroup/freezer/x will actually point to /a/b/x - but the 'root' field (field 3) in mountinfo will show /a/b. So as to not confuse the cgroup calculation, check for nsroot=/a/b in the last field, which will allow us to disambiguate between a mount like above, and a bind mount of the /a/b cgroup directory. Signed-off-by: Serge Hallyn <[email protected]>

hallyn · 2016-04-20T14:59:54Z

(closing this until upstream churn settles down)

cyphar · 2016-06-03T12:54:22Z

@hallyn So, cgroup namespaces have been merged (in a very different state to the current proposal here). We can reopen this once #781 and the dependent PRs all get merged.

hallyn · 2016-06-03T18:18:58Z

This PR should not be needed as we now have /proc/self/mountinfo's cgroup entries virtualized wrt the cgns the same as the /proc/self/cgroup entries. So while I've not tested it, I expect it to "just work" now.

SpComb · 2016-06-16T16:20:23Z

It seems that this patch is still required when running docker-in-lxd on Ubuntu xenial's linux 4.4 kernel, with the parent nsroot in /proc/self/mountinfo. Ubuntu's own docker.io packages carry this patch and work, but the upstream docker-engine package fails, note the broken /sys/fs/docker/... cgroup path:

level=error msg="containerd: start container" error="oci runtime error: could not synchronise with container process: stat /sys/fs/docker/13bd9f844f91631c0459d5fabdad8b2e555d7765eefbc7158b4db428b118b008: no such file or directory

This (or some other related issue) causes docker run debian:jessie bash to fail with errors like: docker: Error response from daemon: Container command 'bash' not found or does not exist..

hallyn · 2016-06-16T16:31:36Z

the kernel to fix that should have just cleared xenial-proposed.

SpComb · 2016-06-16T16:44:30Z

Cool, do you have a linux-image-generic version number for the package with the fix? I've been testing with 4.4.0-22-generic. I'd be happy to test against the upstream docker-engine package with the fixed kernel.

hallyn · 2016-06-16T17:07:49Z

I guess I was wrong about it clearing -proposed. The fix is in 4.4.0-25.44 https://launchpad.net/ubuntu/+source/linux/4.4.0-25.44

SpComb · 2016-06-17T07:01:42Z

I can confirm that with Ubuntu xenial-proposed linux 4.4.0-25-generic including the cgroup namespace updates, /proc/self/mountinfo now shows a path of / within the lxd cgroup namespace, and the unpatched upstream docker-engine 1.11.2 package works correctly within a privileged lxd container.

So assuming it's the same behaviour as on linux 4.6, this should be fine now.

hallyn · 2016-06-17T07:31:56Z

Great - thx for testing.

…ional-case config: Fix 'optional' -> 'OPTIONAL' for process.terminal

GordonTheTurtle added the status/0-triage label Mar 3, 2016

tianon added a commit to tianon/debian-docker that referenced this pull request Mar 9, 2016

Add opencontainers/runc#617 (for more LXD nesting support)

3bc82b5

hallyn force-pushed the 2016-03-02/userns branch 2 times, most recently from fc26d19 to 9285fbd Compare March 26, 2016 07:03

hallyn force-pushed the 2016-03-02/userns branch from 9285fbd to 76935ff Compare March 29, 2016 03:24

hallyn closed this Apr 20, 2016

hallyn mentioned this pull request Apr 28, 2016

Support running the Engine daemon inside a user namespace moby/moby#20902

Closed

stefanberger pushed a commit to stefanberger/runc that referenced this pull request Sep 8, 2017

Merge pull request opencontainers#617 from wking/process.terminal-opt…

b69dcba

…ional-case config: Fix 'optional' -> 'OPTIONAL' for process.terminal

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617

cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617

hallyn commented Mar 3, 2016

mrunalp commented Mar 3, 2016

mrunalp commented Mar 3, 2016

hallyn commented Mar 3, 2016

mrunalp commented Mar 3, 2016

mlaventure commented Mar 3, 2016

hallyn commented Mar 3, 2016 via email

hqhq commented Mar 5, 2016

cyphar commented Mar 7, 2016

hqhq commented Mar 7, 2016

mlaventure commented Mar 7, 2016

mrunalp commented Mar 7, 2016

cyphar commented Mar 8, 2016

hallyn commented Apr 20, 2016

cyphar commented Jun 3, 2016

hallyn commented Jun 3, 2016 via email

SpComb commented Jun 16, 2016 •

edited

Loading

hallyn commented Jun 16, 2016 via email

SpComb commented Jun 16, 2016

hallyn commented Jun 16, 2016 via email

SpComb commented Jun 17, 2016 •

edited

Loading

hallyn commented Jun 17, 2016 via email

cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617

cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617

Conversation

hallyn commented Mar 3, 2016

mrunalp commented Mar 3, 2016

mrunalp commented Mar 3, 2016

hallyn commented Mar 3, 2016

mrunalp commented Mar 3, 2016

mlaventure commented Mar 3, 2016

hallyn commented Mar 3, 2016 via email

hqhq commented Mar 5, 2016

cyphar commented Mar 7, 2016

hqhq commented Mar 7, 2016

mlaventure commented Mar 7, 2016

mrunalp commented Mar 7, 2016

cyphar commented Mar 8, 2016

hallyn commented Apr 20, 2016

cyphar commented Jun 3, 2016

hallyn commented Jun 3, 2016 via email

SpComb commented Jun 16, 2016 • edited Loading

hallyn commented Jun 16, 2016 via email

SpComb commented Jun 16, 2016

hallyn commented Jun 16, 2016 via email

SpComb commented Jun 17, 2016 • edited Loading

hallyn commented Jun 17, 2016 via email

SpComb commented Jun 16, 2016 •

edited

Loading

SpComb commented Jun 17, 2016 •

edited

Loading