Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: refactor cpuset partitioning #18371

Merged
merged 2 commits into from
Sep 12, 2023
Merged

client: refactor cpuset partitioning #18371

merged 2 commits into from
Sep 12, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Aug 31, 2023

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

Documentation coming in another PR.

Reviewers: see NMD-186 (internal)

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @shoenig!

In addition to my review, I've checked the branch out locally and ran some jobs with and without cores on a cg2 machine. It looks like all the cpuset cgroups are getting configured successfully as I'd expect, including for the docker driver which was kind of tricky to figure out -- we should probably do some documentation of where folks should expect those cgroups to be updated.

Nice work!

Comment on lines 13 to 19
// MockPartition creates an in-memory Partition manager backed by 8 fake cpu cores.
func MockPartition() Partition {
return &mock{
share: idset.From[hw.CoreID]([]hw.CoreID{0, 1, 2, 3, 4, 5, 6, 7}),
reserve: idset.Empty[hw.CoreID](),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicky: given there's only one caller of this (in client/state/upgrade_int_test.go), and that caller doesn't seem to care about the results, could we give it a NoopPartition instead? For most testing it seems like being able to use either the noop or the real partitions with paths in the test tempdir would be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout, fixed!

) *cpuPartsHook {

return &cpuPartsHook{
logger: logger,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger: logger,
logger: logger.Named("cpuPartsHookName"),

@@ -0,0 +1,56 @@
// Copyright (c) HashiCorp, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how clean the hook code is here because you've isolated the cpuset management really well.

@@ -118,6 +117,7 @@ type Lifecycle interface {
type lifeCG1 struct {
allocID string
task string
cores bool // uses core reservation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cores bool // uses core reservation
reserveCores bool // uses core reservation

This might make some of the downstream code a little more legible. But it's also package-internal so 🤷

// the name of the cpuset mems interface file
const memsFile = "cpuset.mems"

const memsSet = "0" // TODO(shoenig) get from topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO for a follow-up PR or was it missed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intentional; the memset plumbing will be part of the NUMA implementation PR.


// cpuset is used to manage the cpuset.cpus interface file in the cgroup that
// docker daemon creates for the container being run by the task driver. we
// must do this hack because docker does not allow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not allow...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, fixed with "running in a pre-existing cgroup that we control".

@@ -1560,6 +1560,8 @@ func TestDockerDriver_Init(t *testing.T) {
}

func TestDockerDriver_CPUSetCPUs(t *testing.T) {
// The cpuset_cpus config option is ignored starting in Nomad 1.6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The cpuset_cpus config option is ignored starting in Nomad 1.6
// The cpuset_cpus config option is ignored starting in Nomad 1.7

I think? Or are we ignoring it already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch, 1.7 is the correct version

case cgroupslib.CG1:
cgroup = "/sys/fs/cgroup/cpuset/docker/" + h.containerID
default:
// systemd driver; not sure if we need to consider cgroupfs driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sure feels like the sort of thing few people are going to fiddle with, but if you're running Docker on a non-systemd Linux (like Alpine or Void, which we know a few users are doing), I bet you end up with the cgroupfs driver. We might want to consider trying a fallback path here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #18461 to make sure these cases get covered.

Comment on lines +661 to +663
// // set the libcontainer hook for writing the PID to cgroup.procs file
// TODO: this can be cg1 only, right?
// l.configureCgroupHook(cfg, command)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO: this can be cg1 only, right?

I think so, but would be worth checking via a smoke test on cg2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants