-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: enable cpuset support for cgroups.v2 #12274
Conversation
This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great @shoenig! I've reviewed the driver and fingerprinter changes at this point and have started working thru the cgutil package changes. I'll have to finish my review tomorrow but I wanted to save my review in the meantime so I'll mark it as a comment.
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are | ||
now supported. A Nomad client will only activate its v2 control groups manager if the | ||
system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies | ||
Nomad will continue to fallback to the v1 control groups manager on systems | ||
configured to run in hybrid mode, where the cgroups2 controller is typically mounted | ||
at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A | ||
new client attribute `unique.cgroup.version` indicates which version of control groups | ||
Nomad is using. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is 100% accurate but I think it might be missing whether the user needs to do anything in particular here, especially for users who aren't really solid on just what cgroup v1 vs v2 is. I'd even consider breaking it down into bullet points:
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are | |
now supported. A Nomad client will only activate its v2 control groups manager if the | |
system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies | |
Nomad will continue to fallback to the v1 control groups manager on systems | |
configured to run in hybrid mode, where the cgroups2 controller is typically mounted | |
at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A | |
new client attribute `unique.cgroup.version` indicates which version of control groups | |
Nomad is using. | |
Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] | |
are now supported. A Nomad client will only activate its v2 control groups manager | |
if the system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. | |
* Systems that do not support cgroups v2 are not affected. | |
* Hosts configured in hybrid mode typically mount the cgroups2 | |
controller at `/sys/fs/cgroup/unified`, so Nomad will continue to | |
use cgroups v1 for these hosts. | |
* Hosts configured for only cgroups v2 will now correctly support | |
`cpuset`. | |
Nomad will preserve the existing cgroup for tasks when a client is | |
upgraded, so there will be no disruption to tasks. A new client | |
attribute `unique.cgroup.version` indicates which version of control | |
groups Nomad is using. |
These cgroups are created by Nomad before a task starts. External task drivers that support | ||
containerization should be updated to make use of the new cgroup locations. | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``` | |
The new cgroup file system layout will look like the following: | |
```shell-session |
|
||
func cgroupsCompatibleV1(t *testing.T) bool { | ||
if runtime.GOOS != "linux" { | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreachable given we have the build flag on this file, but maybe ok to leave in anyways if you think it makes the code clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in favor of a comment reminding of the build tag
// GetCgroupParent returns the mount point under the root cgroup in which Nomad | ||
// will create cgroups. If parent is not set, an appropriate name for the version | ||
// of cgroups will be used. | ||
func GetCgroupParent(parent string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever not ""
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! May as well make use of it and cleanup direct calls to getParentV1/V2
if f.lastState == cgroupAvailable { | ||
f.logger.Info("cgroups are unavailable") | ||
f.logger.Warn("cgroups are now unavailable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a bad situation to be in! 😀 But it looks like you've got the cpuset fixer handling that as gracefully as we can, so 👍
// Due to Docker not allowing the configuration of the full cgroup path, we must | ||
// manually fix the cpuset values for all docker containers continuously, as the | ||
// values will change as tasks of any driver using reserved cores are started and | ||
// stopped, changing the size of the remaining shared cpu pool. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my clarity, in the case of #11705 with cgroups v2: any child processes will be in the same cgroup as their parent that Docker starts. So although they won't be pinned to the right CPU for a very brief window at startup, they'll all get moved together to the correct cpuset
because we're not moving processes between cgroups, just changing the cgroup in place.
Is my understanding right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! And actually this is probably worth amending the RFC and reconsidering.
Although this behavior is most similar to the way V1 works, I don't see why it wouldn't be worth just setting the cpuset on the docker task config on initial startup now. The drawback is I don't think we have the plumbing to get the shared
cores - the initial value would only be the reserved
cores requested by the task resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it could be easier for us to understand that way, for sure.
I was going to say that some applications fingerprint cores at startup and not again afterwards (ex. GOMAXPROCS
). But they could potentially take a hit either way: from not having enough threads if we have the initial value only be reserved
, or from contention if we don't. I suspect if you care about this, you probably also need to care enough to set the values right for the application as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether.
cf: #12374
func (cf *cpusetFixer) fix(c coordinate) { | ||
source := c.NomadCgroup() | ||
destination := c.DockerCgroup() | ||
if err := cgutil.CopyCpuset(source, destination); err != nil { | ||
cf.logger.Debug("failed to copy cpuset", "err", err) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the design here where we didn't end up plumbing any management code into the driver, but are just copying the cpu sets between the Nomad-managed cgroup and the Docker-managed one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting to hear back on moby/moby#43363
Which if that would work, all this goes away!
return false, fmt.Errorf("Not a member of the alloc's cgroup: expected=...:/nomad/... -- found=%q", line) | ||
// Skip rdma subsystem; rdma was added in most recent kernels and libcontainer/docker | ||
// don't isolate it by default. | ||
if strings.Contains(line, ":rdma:") || strings.Contains(line, "::") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we skip :misc:
here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
// github actions freezer cgroup | ||
acceptable = append(acceptable, ":freezer:/actions_job") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, I imagine you discovered this experimentally and it wasn't documented? Probably nothing we can do about it but some day that's going to change out from under us and break a bunch of tests. So at least you've documented it! 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh yeah, it's really got me concerned curious how GHA works under the hood ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @shoenig, nice work! I've left a few small comments and questions.
create(t, source) | ||
defer cleanup(t, source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky: the mgr.Apply
call isn't an atomic change. It can create the paths before the leaf successfully and then return an error for its final step. So if we call require.NoError
in create
we may never call the cleanup function. Maybe create
can clean itself up on failure, and then do a require.NoError
to fail the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, fixed
"github.com/stretchr/testify/require" | ||
) | ||
|
||
// Note: these tests need to run on GitHub Actions runners with only 2 cores. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "at least 2 cores"? It'd be good to have a testutil
for skipping this based on the number of cores, because otherwise folks might end up running on single-core Vagrant boxes and failing unexpectedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, added
client/lib/cgutil/cgutil_linux.go
Outdated
// CgroupScope returns the name of the scope for Nomad's managed cgroups for | ||
// the given allocID and task. | ||
// | ||
// e.g. "<allocID>-<task>.scope" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want a dot to match what we've done in the code everywhere:
// e.g. "<allocID>-<task>.scope" | |
// e.g. "<allocID>.<task>.scope" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
// identity is the "<allocID>.<taskName>" string that uniquely identifies an | ||
// individual instance of a task within the flat cgroup namespace | ||
type identity string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ❤️ type-aliased IDs
// rootless is (for now) always false; Nomad clients require root, so we | ||
// assume to not need to do the extra plumbing for rootless cgroups. | ||
rootless = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the version of libcontainer we have would return an error if we passed rootless = true
anyways. And more recent versions of libcontainer/cgroups/fs2
drop this parameter entirely. Not sure whether we want to update the comment to reflect we know that it's safe to remove when we bump versions of libcontainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened and noted in #12372
// null represents nothing | ||
var null = nothing{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like what you're doing here but calling it null
means we do c.sharing[id] = null
to add an ID to the sharing set, which reads almost the opposite of what you're going for. Maybe if it were c.sharing[id] = ok
or something like that?
(What I wouldn't do for a real set type in the stdlib!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed null
to present
to be more clear
(What I wouldn't do for a real set type in the stdlib!)
+10000000000000000000000
// We avoid removing a cgroup if it still contains a PID, as the cpuset manager | ||
// may be initially empty on a Nomad client restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do it for this PR, but it might be nice if we can eventually figure out a way to clean up stray cgroup entries. The allocdir hook is the only hook that comes before the cgroup hook, but if its postrun hook throws an error I think we'll never clean up the cgroup entry for that PID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point; it's a similar problem for other resources like networks too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if taskInfo.Error != nil { | ||
break | ||
} | ||
|
||
timer.Reset(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move this into the timer.C
case below, I think. Or do we pretty much always need the initial 100ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on
/sys/fs/cgroups
. Newer Linuxdistros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing
cpuset
cgroups. Before, Nomad has been making use ofa feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it now manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups
. This means on systems running in hybrid mode with cgroups2mounted at
/sys/fs/cgroups/unified
(as is typical) Nomad will continue touse the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called
nomad.slice
(unlessotherwise configured in Client conifg), and create cgroups for tasks using
naming convention
<allocID>-<task>.scope
. These follow the naming conventionset by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute,
unique.cgroup.version
which will be set to
"v1"
or"v2"
to indicate the cgroups regime in use byNomad.
The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak" and make use of forbidden cpu cores.
With the v2 manager, the PIDs are started in the cgroup they will always live in, and
thus the source of the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
Closes #11289
Fixes #11705 #11773 #11933
Review Notes:
libcontainer
isn't going to allow that. Presumably we'll be able to use GHAubuntu-22.04
runners in April. See manual testing in comments.