cgroup2: how can we support nested containers with domain controllers? #2356

AkihiroSuda · 2020-04-27T17:37:12Z

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#no-internal-process-constraint

No Internal Process Constraint

Non-root cgroups can distribute domain resources to their children only when they don’t have any processes of their own. In other words, only domain cgroups which don’t contain any processes can have domain controllers enabled in their “cgroup.subtree_control” files.

http://man7.org/linux/man-pages/man7/cgroups.7.html

As at Linux 4.19, the following controllers are threaded: cpu, perf_event, and pids.

The constraint seems blocker toward supporting nested containers like dind and kind.

$ sudo podman run -it --rm --runtime=runc --privileged --cgroupns=private alpine
/ # cd /sys/fs/cgroup/
/sys/fs/cgroup # cat cgroup.controllers 
cpu io memory pids
/sys/fs/cgroup # echo +cpu > cgroup.subtree_control 
/sys/fs/cgroup # echo +io > cgroup.subtree_control 
sh: write error: Not supported
/sys/fs/cgroup # echo +memory > cgroup.subtree_control 
sh: write error: Not supported
/sys/fs/cgroup # echo +pids > cgroup.subtree_control

The situation is same on crun as well.

@giuseppe @kolyshkin @vbatts Thoughts?
A workaround is to specify an entrypoint script that moves the processes in the namespaced-root cgroup to another cgroup.

giuseppe · 2020-04-29T11:30:02Z

A workaround is to specify an entrypoint script that moves the processes in the namespaced-root cgroup to another cgroup.

I think that is the correct solution and what systemd does. Do you think it should be the OCI runtime responsibility to set it up?

cyphar · 2020-04-29T12:46:09Z

@brauner flagged this issue several years ago. Christian, what does LXC/LXD do in this case? I know we discussed a whole host of possible cgroup trees and ways to resolve them -- but I don't remember what the conclusion was.

Do you think it should be the OCI runtime responsibility to set it up?

This is definitely a philosophical question. In theory I would say no, because it affects other processes on the system in potentially bad ways (first of all, systemd -- but even if the cgroup has been delegated the other processes may be managing it and you'll hit a whole bunch of race conditions). But if there's no way to reasonably support this without doing it, then we don't really have much of a choice...

giuseppe · 2020-04-29T15:14:47Z

a related issue is how to exec into a container that creates a sub-cgroup. I've seen the issue with systemd containers, as systemd automatically creates /init.scope and moves itself there.

What cgroup should we join on runc exec $CTR? In crun I am using a hack to automatically create a sub-cgroup on EBUSY, but it doesn't feel right. Should it be the cgroup for the first process in the cgroup?

AkihiroSuda · 2020-04-29T15:17:51Z

Should it be the cgroup for the first process in the cgroup?

👍

cyphar · 2020-04-30T01:37:30Z

For the moment (given runc's architecture) it should be the pid1, which we generally treat as "the source of truth" for the container state. This does mean that the information we currently store in /run/runc/... will need to be removed (and boy is that going to be fun).

But for the sake of posterity (and I have mentioned this a few times over the years), using pid1 as the source of truth directly does have quite a few disadvantages. Most notably, the container process can try to trick you into (for instance) joining a user namespace it created where it has more privileges than you expected. A nicer solution would be to stash information in a temporary mount namespace, but this is currently non-trivial on modern Linux. I'm mostly just mentioning this because it's something I'd like to change, even though it doesn't really affect this particular issue.

AkihiroSuda · 2020-05-11T12:16:00Z

Most notably, the container process can try to trick you into (for instance) joining a user namespace it created where it has more privileges than you expected.

How is a cgroup associated with a user namespace?

cyphar · 2020-05-11T12:49:09Z

It looks like this change requires modifying the state file for containers and switching to trusting the pid1 of the container. My comment was more generally about why it can be a little hairy to trust pid1 in that manner, and that we should work on improving that at some point. I wasn't saying that there's a particular issue in this case.

AkihiroSuda · 2020-05-19T13:03:00Z

PR: #2416

AkihiroSuda added area/cgroupv2 enhancement kind/question and removed enhancement labels Apr 27, 2020

AkihiroSuda mentioned this issue Apr 30, 2020

cgroup2: Revert "CreateCgroupPath: only enable needed controllers" #2367

Closed

kolyshkin mentioned this issue Apr 30, 2020

cgroupv2 support meta issue #2315

Closed

AkihiroSuda mentioned this issue May 19, 2020

cgroup2: exec: join the cgroup of the init process on EBUSY #2416

Merged

mrunalp closed this as completed in #2416 May 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup2: how can we support nested containers with domain controllers? #2356

cgroup2: how can we support nested containers with domain controllers? #2356

AkihiroSuda commented Apr 27, 2020 •

edited

Loading

No Internal Process Constraint

giuseppe commented Apr 29, 2020

cyphar commented Apr 29, 2020

giuseppe commented Apr 29, 2020

AkihiroSuda commented Apr 29, 2020

cyphar commented Apr 30, 2020

AkihiroSuda commented May 11, 2020

cyphar commented May 11, 2020

AkihiroSuda commented May 19, 2020

cgroup2: how can we support nested containers with domain controllers? #2356

cgroup2: how can we support nested containers with domain controllers? #2356

Comments

AkihiroSuda commented Apr 27, 2020 • edited Loading

No Internal Process Constraint

giuseppe commented Apr 29, 2020

cyphar commented Apr 29, 2020

giuseppe commented Apr 29, 2020

AkihiroSuda commented Apr 29, 2020

cyphar commented Apr 30, 2020

AkihiroSuda commented May 11, 2020

cyphar commented May 11, 2020

AkihiroSuda commented May 19, 2020

AkihiroSuda commented Apr 27, 2020 •

edited

Loading