-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: refactor cpuset partitioning #18371
Conversation
This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @shoenig!
In addition to my review, I've checked the branch out locally and ran some jobs with and without cores
on a cg2 machine. It looks like all the cpuset cgroups are getting configured successfully as I'd expect, including for the docker
driver which was kind of tricky to figure out -- we should probably do some documentation of where folks should expect those cgroups to be updated.
Nice work!
client/lib/cgroupslib/testing.go
Outdated
// MockPartition creates an in-memory Partition manager backed by 8 fake cpu cores. | ||
func MockPartition() Partition { | ||
return &mock{ | ||
share: idset.From[hw.CoreID]([]hw.CoreID{0, 1, 2, 3, 4, 5, 6, 7}), | ||
reserve: idset.Empty[hw.CoreID](), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky: given there's only one caller of this (in client/state/upgrade_int_test.go
), and that caller doesn't seem to care about the results, could we give it a NoopPartition
instead? For most testing it seems like being able to use either the noop or the real partitions with paths in the test tempdir would be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callout, fixed!
client/allocrunner/cpuparts_hook.go
Outdated
) *cpuPartsHook { | ||
|
||
return &cpuPartsHook{ | ||
logger: logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger: logger, | |
logger: logger.Named("cpuPartsHookName"), |
@@ -0,0 +1,56 @@ | |||
// Copyright (c) HashiCorp, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love how clean the hook code is here because you've isolated the cpuset management really well.
client/lib/cgroupslib/editor.go
Outdated
@@ -118,6 +117,7 @@ type Lifecycle interface { | |||
type lifeCG1 struct { | |||
allocID string | |||
task string | |||
cores bool // uses core reservation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cores bool // uses core reservation | |
reserveCores bool // uses core reservation |
This might make some of the downstream code a little more legible. But it's also package-internal so 🤷
// the name of the cpuset mems interface file | ||
const memsFile = "cpuset.mems" | ||
|
||
const memsSet = "0" // TODO(shoenig) get from topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO for a follow-up PR or was it missed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intentional; the memset plumbing will be part of the NUMA implementation PR.
drivers/docker/cpuset.go
Outdated
|
||
// cpuset is used to manage the cpuset.cpus interface file in the cgroup that | ||
// docker daemon creates for the container being run by the task driver. we | ||
// must do this hack because docker does not allow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does not allow...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, fixed with "running in a pre-existing cgroup that we control".
drivers/docker/driver_test.go
Outdated
@@ -1560,6 +1560,8 @@ func TestDockerDriver_Init(t *testing.T) { | |||
} | |||
|
|||
func TestDockerDriver_CPUSetCPUs(t *testing.T) { | |||
// The cpuset_cpus config option is ignored starting in Nomad 1.6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// The cpuset_cpus config option is ignored starting in Nomad 1.6 | |
// The cpuset_cpus config option is ignored starting in Nomad 1.7 |
I think? Or are we ignoring it already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch, 1.7 is the correct version
case cgroupslib.CG1: | ||
cgroup = "/sys/fs/cgroup/cpuset/docker/" + h.containerID | ||
default: | ||
// systemd driver; not sure if we need to consider cgroupfs driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sure feels like the sort of thing few people are going to fiddle with, but if you're running Docker on a non-systemd Linux (like Alpine or Void, which we know a few users are doing), I bet you end up with the cgroupfs driver. We might want to consider trying a fallback path here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #18461 to make sure these cases get covered.
// // set the libcontainer hook for writing the PID to cgroup.procs file | ||
// TODO: this can be cg1 only, right? | ||
// l.configureCgroupHook(cfg, command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO: this can be cg1 only, right?
I think so, but would be worth checking via a smoke test on cg2
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
Documentation coming in another PR.
Reviewers: see NMD-186 (internal)