Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cgroup2: how can we support nested containers with domain controllers? #2356

Closed
AkihiroSuda opened this issue Apr 27, 2020 · 8 comments · Fixed by #2416
Closed

cgroup2: how can we support nested containers with domain controllers? #2356

AkihiroSuda opened this issue Apr 27, 2020 · 8 comments · Fixed by #2416

Comments

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Apr 27, 2020

https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#no-internal-process-constraint

No Internal Process Constraint

Non-root cgroups can distribute domain resources to their children only when they don’t have any processes of their own. In other words, only domain cgroups which don’t contain any processes can have domain controllers enabled in their “cgroup.subtree_control” files.

http://man7.org/linux/man-pages/man7/cgroups.7.html

As at Linux 4.19, the following controllers are threaded: cpu, perf_event, and pids.

The constraint seems blocker toward supporting nested containers like dind and kind.

$ sudo podman run -it --rm --runtime=runc --privileged --cgroupns=private alpine
/ # cd /sys/fs/cgroup/
/sys/fs/cgroup # cat cgroup.controllers 
cpu io memory pids
/sys/fs/cgroup # echo +cpu > cgroup.subtree_control 
/sys/fs/cgroup # echo +io > cgroup.subtree_control 
sh: write error: Not supported
/sys/fs/cgroup # echo +memory > cgroup.subtree_control 
sh: write error: Not supported
/sys/fs/cgroup # echo +pids > cgroup.subtree_control 

The situation is same on crun as well.

@giuseppe @kolyshkin @vbatts Thoughts?
A workaround is to specify an entrypoint script that moves the processes in the namespaced-root cgroup to another cgroup.

@giuseppe
Copy link
Member

A workaround is to specify an entrypoint script that moves the processes in the namespaced-root cgroup to another cgroup.

I think that is the correct solution and what systemd does. Do you think it should be the OCI runtime responsibility to set it up?

@cyphar
Copy link
Member

cyphar commented Apr 29, 2020

@brauner flagged this issue several years ago. Christian, what does LXC/LXD do in this case? I know we discussed a whole host of possible cgroup trees and ways to resolve them -- but I don't remember what the conclusion was.

Do you think it should be the OCI runtime responsibility to set it up?

This is definitely a philosophical question. In theory I would say no, because it affects other processes on the system in potentially bad ways (first of all, systemd -- but even if the cgroup has been delegated the other processes may be managing it and you'll hit a whole bunch of race conditions). But if there's no way to reasonably support this without doing it, then we don't really have much of a choice...

@giuseppe
Copy link
Member

a related issue is how to exec into a container that creates a sub-cgroup. I've seen the issue with systemd containers, as systemd automatically creates /init.scope and moves itself there.

What cgroup should we join on runc exec $CTR? In crun I am using a hack to automatically create a sub-cgroup on EBUSY, but it doesn't feel right. Should it be the cgroup for the first process in the cgroup?

@AkihiroSuda
Copy link
Member Author

Should it be the cgroup for the first process in the cgroup?

👍

@cyphar
Copy link
Member

cyphar commented Apr 30, 2020

For the moment (given runc's architecture) it should be the pid1, which we generally treat as "the source of truth" for the container state. This does mean that the information we currently store in /run/runc/... will need to be removed (and boy is that going to be fun).


But for the sake of posterity (and I have mentioned this a few times over the years), using pid1 as the source of truth directly does have quite a few disadvantages. Most notably, the container process can try to trick you into (for instance) joining a user namespace it created where it has more privileges than you expected. A nicer solution would be to stash information in a temporary mount namespace, but this is currently non-trivial on modern Linux. I'm mostly just mentioning this because it's something I'd like to change, even though it doesn't really affect this particular issue.

@AkihiroSuda
Copy link
Member Author

Most notably, the container process can try to trick you into (for instance) joining a user namespace it created where it has more privileges than you expected.

How is a cgroup associated with a user namespace?

@cyphar
Copy link
Member

cyphar commented May 11, 2020

It looks like this change requires modifying the state file for containers and switching to trusting the pid1 of the container. My comment was more generally about why it can be a little hairy to trust pid1 in that manner, and that we should work on improving that at some point. I wasn't saying that there's a particular issue in this case.

@AkihiroSuda
Copy link
Member Author

PR: #2416

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants