-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cgroup namespaces: ignore the mount.Root if we have cgroup namespaces #617
Conversation
LGTM |
@hallyn Are cgroups namespaces expected to be in 4.6? (BTW, thanks for your upstream work in reviving the patchset). |
@mrunalp That's the hope :) They're in linux-next at the moment. |
Awesome :) Looking forward to it! Sent from my iPhone
|
Shouldn't this also check that cgroups ns was enabled for that container? Or does this work even in the initial cgroup namespace? |
How I understand this code to work is,
if the cgroupfs mount is a bind mount into a container, and the host's
/sys/fs/cgroup/freezer/lxc/c1
was bind mounted into the container such that
/sys/fs/cgroup/freezer in the container is a tmpfs, 'lxc/c1' are directories in
the tmpfs, and the bind mount is onto /sys/fs/cgroup/freezer/lxc/c1, then the
container will find that it is in cgroup '/lxc/c1', so to create /lxc/c1/x1,
it must create 'x1' under the mountpoint. So it concatenates the mountpoint,
/sys/fs/cgroup/freezer/lxc/c1, with the mount prefix removed from the wanted
path (/lxc/c1/x1 - /lxc/c1 = x1).
Simply put, when cgroup namespaces are enabled, this gimmick is not needed at
all because the container was able to mount a real cgroupfs.
Now unfortunately there is no way to absolutely tell which case we have going.
In both cases the fs shows up as 'cgroup'. In both cases it remains possible
that current's cgroup start with the mount prefix: with cgroup namespaces,
I could be in absolute cgroup /lxc/c1/lxc/c1, namespaced /lxc/c1. So in
this patch I simply assume that if cgroup namespaces are available, then they
will be used.
|
I'm afraid we don't have this assumption now, namespaces are configurable, this assumption probably would break backward compatibility on new kernel, right? |
@hqhq I'm fairly namespace joining used to be kernel-version backwards compatible (we don't join namespaces which don't exist and only emit warnings). I'm not sure now (reading over the code, it doesn't look like that's the case anymore). |
@cyphar I'm not worry about the kernel, but concern we can't use old If we gonna do a follow up PR to fix that, this PR LGTM. |
I don't think runC should automatically join namespaces that are not listed in the config. That would be quite un-intuitive and limit some valid use cases. If I'm not mistaken, atm we only join the namespaces that are actually listed within the config file. Which means that this PR should probably be updated to also check that a cgroup ns was joined before changing the value of |
I think we should probably wait for cgroups namespaces to be available before we merge this and other required support to runc. |
@hqhq Sorry, when I said "kernel-version backwards compatible" I was referring to runc being compatible with older kernels. I'm not sure that this is the case anymore (reading the code for joining namespaces, it looks like we error out if we can't join any one of them). But yeah, I understand what you meant. |
fc26d19
to
9285fbd
Compare
If our cgroups were mounted in a cgroup namespace rooted at /a/b, then a task in namespaced cgroup / will see / in /proc/self/mountinfo and /sys/fs/cgroup/freezer/x will actually point to /a/b/x - but the 'root' field (field 3) in mountinfo will show /a/b. So as to not confuse the cgroup calculation, check for nsroot=/a/b in the last field, which will allow us to disambiguate between a mount like above, and a bind mount of the /a/b cgroup directory. Signed-off-by: Serge Hallyn <[email protected]>
9285fbd
to
76935ff
Compare
(closing this until upstream churn settles down) |
This PR should not be needed as we now have /proc/self/mountinfo's cgroup entries virtualized wrt the cgns the same as the /proc/self/cgroup entries.
So while I've not tested it, I expect it to "just work" now.
|
It seems that this patch is still required when running docker-in-lxd on Ubuntu xenial's linux 4.4 kernel, with the parent nsroot in
This (or some other related issue) causes |
the kernel to fix that should have just cleared xenial-proposed.
|
Cool, do you have a linux-image-generic version number for the package with the fix? I've been testing with 4.4.0-22-generic. I'd be happy to test against the upstream docker-engine package with the fixed kernel. |
I guess I was wrong about it clearing -proposed. The fix is in
4.4.0-25.44
https://launchpad.net/ubuntu/+source/linux/4.4.0-25.44
|
I can confirm that with Ubuntu xenial-proposed linux 4.4.0-25-generic including the cgroup namespace updates, So assuming it's the same behaviour as on linux 4.6, this should be fine now. |
Great - thx for testing.
|
…ional-case config: Fix 'optional' -> 'OPTIONAL' for process.terminal
In a cgroup namespace, you can mount cgroupfs, and your namespace
root (say /docker1) becomes the root of the cgroup filesystem. This
shows up as field 3 in the mountinfo. This is unfortunately
ambiguous with a cgroupfs bind mount, and in this case we cannot use
that root as a prefix for our cgroup, since as far as we are concerned
our cgroup is '/', not '/docker1'.
So if cgroup namespaces are enabled (/proc/$$/ns/cgroup exists), then
assume that we haven't done any silly cgroupfs bind mount trickery,
and ignore the fs root field.
Signed-off-by: Serge Hallyn [email protected]