Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix discrepancy between systemd and fs cgroup managers #2813

Closed
kolyshkin opened this issue Feb 20, 2021 · 4 comments · Fixed by #2814
Closed

Fix discrepancy between systemd and fs cgroup managers #2813

kolyshkin opened this issue Feb 20, 2021 · 4 comments · Fixed by #2814

Comments

@kolyshkin
Copy link
Contributor

kolyshkin commented Feb 20, 2021

All cgroup managers implement Apply() and Set() methods:

  • Apply is used to create a cgroup (and, in case of systemd, a systemd unit) and/or put a PID into the cgroup (and unit);
  • Set is used to set various cgroup resources and limits;

fs/fs2 cgroup managers implement the functionality as per the description above.

systemd v1/v2 managers are somewhat weird. They set most of cgroup limits (those that can be projected to systemd unit properties) in Apply(), and they set all cgroup limits in Set -- first indirectly via systemd properties -- same as in Apply, then via cgroupfs (actually backed by fs manager)).

A bit of recent history -- before #2287/#2343 systemd managers used to set unit properties (incl. resources) in Apply, and cgroup "raw" properties in Set. After that PR, they set properties in both Apply and Set.

To reiterate, systemd managers are peculiar since:

  • they set some properties in Apply() (fs managers do not);
  • they set all properties again in set.

Since runc calls both Apply() and Set(), this is not really a problem, except for some curious side effects (such as #2812 (comment)). This might be worse for other users (need to be looked at).

The proposed solution is not set and resources in Apply (the actual fix is surprisingly tidy -- see PR #2813).

TODO:

  • look at how crun does it (maybe @giuseppe can shed some light)
  • look at how kubernetes uses cgroup manager's Set and Apply.
@odinuge
Copy link
Contributor

odinuge commented Feb 22, 2021

+1 on this one!

I have been working with this from the k8s side, so I can help with insight if needed. Anything special you need?

I also think we should strive to adhere to the systemd delegation a bit better as well (eg we don't need to write to cpu.shares/cpu.weight when systemd already will do that): https://systemd.io/CGROUP_DELEGATION/

I think crun only create the slice/scope, and has a separate sub-cgroup (for v2 at least) where they enforce limits. I do however the runc approach is better, since we actually use the systemd-api.

@odinuge
Copy link
Contributor

odinuge commented Feb 22, 2021

look at how kubernetes uses cgroup manager's Set and Apply.

The changes you are talking about here make sense for k8s, since we always use Set after Apply in order to enforce limits. The systemd-part is however currently broken/has never really made sense, but that should be fixed with: kubernetes/kubernetes#98374 (feel free to look at that PR and/or review if you have any thoughts. The PR will also not be a part of 1.21, so targeting 1.22).

@kolyshkin
Copy link
Contributor Author

look at how kubernetes uses cgroup manager's Set and Apply.

  • kubelet does NOT use systemd's Set() method
  • cri-o does not use it either

So we're good

@kolyshkin
Copy link
Contributor Author

kolyshkin commented Feb 23, 2021

One particular thing this fixes is, with #2812 applied, while trying to start container with very low memory limit on cgroup v1, we get:

fs driver:

$ sudo ./runc run --bundle ./tst -d xe3
ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:505: container init caused: process_linux.go:468: setting cgroup config for procHooks process caused: unable to set memory limit to 140424 (current usage: 2879488, peak usage: 3059712)

systemd driver:

$ sudo ./runc --systemd-cgroup run --bundle ./tst -d xe3
ERRO[0000] container_linux.go:367: starting container process caused: process_linux.go:328: container init was OOM-killed (memory limit too low?) caused: process_linux.go:360: getting the final child's pid from pipe caused: EOF

(this is actually how I found this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants