-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark CGroups as off when missing essential controllers #19176
Conversation
5cf46c2
to
9872aca
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @DavidVentura! The thing that jumps out at me with this PR is that we're now potentially making a partially-applied update to the cgroups because we're not writing updating a single write.
I'd like to pull in @shoenig on this, because it's an area he's been working in a lot recently. But my gut tells me that having the check for the controllers by reading them is a nice UX improvement but we should maybe leave the writes alone. But I think I might be missing something about the setup.
A few housekeeping items aside:
- You'll need to take the PR out of draft for CI to run.
- Can you run
make cl
to add a changelog entry?
Co-authored-by: Tim Gross <[email protected]>
I wasn't entirely convinced on whether the writes should be partially applied or not; the behavior for systems with missing support for a controller changes: When applying a single write for all controllers, ( I can also attempt to apply them all, but I'm not sure what that'd achieve, as anything dependent on the actual controller is bound to fail regardless. |
@DavidVentura can you describe more about your environment? I would expect pretty much any distro with cgroups v2 enabled at all to also have the basic controllers available. I'm hesitant to be supportive of an environment where this isn't true - Nomad relies a great deal on these controllers being enabled to provide the resource isolation gauntness that it does. |
I am experimenting with nomad on risc-v, the default kernel shipped in most VisionFive2 distros comes with the cgroup controllers disabled (see PR to enable it). It took me a while to figure out that the controllers being disabled was causing the I couldn't figure out how to completely disable |
Ahh, I have been blissfully unaware of RISC-V development so far, kinda neat if you can already get Nomad running on it! Thinking about this a little more, it might make sense to simply check for the enabled controllers at the point in time we switch on cgroups v1 or v2 (or off - which is already a thing). If we can detect the necessary controllers are not enabled we can return https://github.com/hashicorp/nomad/blob/main/client/lib/cgroupslib/mount.go#L41-L43 |
I think your idea is a better way to implement this functionality. Making this change would require propagating errors back through quite a few places; I'm happy to do it, but want to double check with you before starting |
In #19481 we've got a report with the identical symptoms but in this case the cpuset controller is activated. The results is that they've got I'm following up on this report and hopefully I'll be able to come back here with some insights. |
I've updated this per @shoenig's recommendation; in the end, I didn't need to update all the paths which behave incorrectly on the |
Thanks @DavidVentura! I'll review this tomorrow. |
Follow-up on my comment in #19176 (comment) is that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @DavidVentura!
Once CI is complete, I'll merge this and get it backported so it can ship in the next release of Nomad 1.7.x. |
On Linux systems that have been built without all necessary CGroup controllers enabled, you may get a non-descriptive error when initializing
cgroupslib
:A similarly-opaque error is logged on every run of
cpuparts_hook
, but I could not find a simple way of clarifying the error logs.This PR checks whether all expected controllers are available, and will log an error if they are not.
On top of that, the controllers will be enabled one at a time, giving a more clear error when one of them is not supported.
You can validate this PR by appending
cgroup_disable=cpuset
to your kernel's command line arguments (/etc/default/grub
or similar)