client: refactor cpuset partitioning #18371

shoenig · 2023-08-31T18:55:28Z

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

Documentation coming in another PR.

Reviewers: see NMD-186 (internal)

This PR updates the way Nomad client manages the split between tasks that make use of resources.cpus vs. resources.cores. Previously, each task was explicitly assigned which CPU cores they were able to run on. Every time a task was started or destroyed, all other tasks' cpusets would need to be updated. This was inefficient and would crush the Linux kernel when a client would try to run ~400 or so tasks. Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently manage cpusets.

tgross

Looks great @shoenig!

In addition to my review, I've checked the branch out locally and ran some jobs with and without cores on a cg2 machine. It looks like all the cpuset cgroups are getting configured successfully as I'd expect, including for the docker driver which was kind of tricky to figure out -- we should probably do some documentation of where folks should expect those cgroups to be updated.

Nice work!

tgross · 2023-09-11T18:37:23Z

client/lib/cgroupslib/testing.go

+// MockPartition creates an in-memory Partition manager backed by 8 fake cpu cores.
+func MockPartition() Partition {
+	return &mock{
+		share:   idset.From[hw.CoreID]([]hw.CoreID{0, 1, 2, 3, 4, 5, 6, 7}),
+		reserve: idset.Empty[hw.CoreID](),
+	}
+}


Nitpicky: given there's only one caller of this (in client/state/upgrade_int_test.go), and that caller doesn't seem to care about the results, could we give it a NoopPartition instead? For most testing it seems like being able to use either the noop or the real partitions with paths in the test tempdir would be sufficient.

Good callout, fixed!

tgross · 2023-09-11T18:38:54Z

client/allocrunner/cpuparts_hook.go

+) *cpuPartsHook {
+
+	return &cpuPartsHook{
+		logger:       logger,


Suggested change

logger: logger,

logger: logger.Named("cpuPartsHookName"),

tgross · 2023-09-11T18:40:56Z

client/allocrunner/cpuparts_hook.go

@@ -0,0 +1,56 @@
+// Copyright (c) HashiCorp, Inc.


I love how clean the hook code is here because you've isolated the cpuset management really well.

tgross · 2023-09-11T18:46:18Z

client/lib/cgroupslib/editor.go

@@ -118,6 +117,7 @@ type Lifecycle interface {
 type lifeCG1 struct {
 	allocID string
 	task    string
+	cores   bool // uses core reservation


Suggested change

cores bool // uses core reservation

reserveCores bool // uses core reservation

This might make some of the downstream code a little more legible. But it's also package-internal so 🤷

tgross · 2023-09-11T18:49:01Z

client/lib/cgroupslib/init.go

+		// the name of the cpuset mems interface file
+		const memsFile = "cpuset.mems"
+
+		const memsSet = "0" // TODO(shoenig) get from topology


Is this TODO for a follow-up PR or was it missed here?

It's intentional; the memset plumbing will be part of the NUMA implementation PR.

tgross · 2023-09-11T18:50:59Z

drivers/docker/cpuset.go

+
+// cpuset is used to manage the cpuset.cpus interface file in the cgroup that
+// docker daemon creates for the container being run by the task driver. we
+// must do this hack because docker does not allow


does not allow...?

heh, fixed with "running in a pre-existing cgroup that we control".

tgross · 2023-09-11T18:53:22Z

drivers/docker/driver_test.go

@@ -1560,6 +1560,8 @@ func TestDockerDriver_Init(t *testing.T) {
 }

 func TestDockerDriver_CPUSetCPUs(t *testing.T) {
+	// The cpuset_cpus config option is ignored starting in Nomad 1.6


Suggested change

// The cpuset_cpus config option is ignored starting in Nomad 1.6

// The cpuset_cpus config option is ignored starting in Nomad 1.7

I think? Or are we ignoring it already?

ah good catch, 1.7 is the correct version

tgross · 2023-09-11T18:58:42Z

drivers/docker/handle.go

+		case cgroupslib.CG1:
+			cgroup = "/sys/fs/cgroup/cpuset/docker/" + h.containerID
+		default:
+			// systemd driver; not sure if we need to consider cgroupfs driver


This sure feels like the sort of thing few people are going to fiddle with, but if you're running Docker on a non-systemd Linux (like Alpine or Void, which we know a few users are doing), I bet you end up with the cgroupfs driver. We might want to consider trying a fallback path here.

Created #18461 to make sure these cases get covered.

tgross · 2023-09-11T19:02:44Z

drivers/shared/executor/executor_linux.go

+	// // set the libcontainer hook for writing the PID to cgroup.procs file
+	// TODO: this can be cg1 only, right?
+	// l.configureCgroupHook(cfg, command)


// TODO: this can be cg1 only, right?

I think so, but would be worth checking via a smoke test on cg2

vercel bot deployed to Preview – nomad-storybook-and-ui August 31, 2023 18:56 View deployment

shoenig added this to the 1.7.0 milestone Sep 1, 2023

shoenig force-pushed the cpuparts branch from 5d7c55f to 1b36fc6 Compare September 1, 2023 15:28

vercel bot deployed to Preview – nomad-storybook-and-ui September 1, 2023 15:31 View deployment

shoenig marked this pull request as ready for review September 1, 2023 15:50

shoenig requested review from tgross and lgfa29 September 1, 2023 15:50

shoenig force-pushed the cpuparts branch from 1b36fc6 to 3a488fc Compare September 11, 2023 14:36

vercel bot deployed to Preview – nomad-storybook-and-ui September 11, 2023 14:42 View deployment

tgross approved these changes Sep 11, 2023

View reviewed changes

shoenig mentioned this pull request Sep 12, 2023

cpuset: make sure the cgroups driver combos cgroupfs+cgroupsv2 and docker+cgroups1 work #18461

Closed

cr: tweaks for feedback

5f1df27

vercel bot deployed to Preview – nomad-storybook-and-ui September 12, 2023 13:45 View deployment

shoenig merged commit 2e1974a into main Sep 12, 2023

shoenig deleted the cpuparts branch September 12, 2023 14:11

gulducat mentioned this pull request Oct 24, 2023

cl: remove cgroup mountpoint #18848

Merged

schmichael mentioned this pull request Oct 31, 2023

docs: changelog & basic docs for 1.7 WI changes #18936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: refactor cpuset partitioning #18371

client: refactor cpuset partitioning #18371

shoenig commented Aug 31, 2023 •

edited

Loading

tgross left a comment

tgross Sep 11, 2023

shoenig Sep 12, 2023

tgross Sep 11, 2023

tgross Sep 11, 2023

tgross Sep 11, 2023

tgross Sep 11, 2023

shoenig Sep 12, 2023

tgross Sep 11, 2023

shoenig Sep 12, 2023

tgross Sep 11, 2023

shoenig Sep 12, 2023

tgross Sep 11, 2023

shoenig Sep 12, 2023

tgross Sep 11, 2023

	cores bool // uses core reservation
	reserveCores bool // uses core reservation

	// The cpuset_cpus config option is ignored starting in Nomad 1.6
	// The cpuset_cpus config option is ignored starting in Nomad 1.7

client: refactor cpuset partitioning #18371

client: refactor cpuset partitioning #18371

Conversation

shoenig commented Aug 31, 2023 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoenig commented Aug 31, 2023 •

edited

Loading