client: enable cpuset support for cgroups.v2 #12274

shoenig · 2022-03-14T17:52:43Z

This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it now manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroup.version
which will be set to "v1" or "v2" to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak" and make use of forbidden cpu cores.
With the v2 manager, the PIDs are started in the cgroup they will always live in, and
thus the source of the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933

Review Notes:

Still missing CI coverage - original plan was to coerce Nomad into using v2 in hybrid mode, but turns out libcontainer isn't going to allow that. Presumably we'll be able to use GHA ubuntu-22.04 runners in April. See manual testing in comments.
Existing v1 manager logic is basically untouched, but moved around a lot, and driver tests are now entangled with v1/v2 switches.
Driver tests could use an overhaul, it's surprisingly difficult to predict the code path involved in a test
Still loose ends for complete cgroups v2 support on non-hybrid systems
- make getAllPidsByCgroup work with cgroups v2 #12351
- Implement raw_exec cgroups v2 support #12348
Still need to update drivers we know about
- Add support for cpuset cgroups v2 nomad-driver-podman#160 (podman)
- shoenig/nomad-driver-containerd@34a9526 (containerd)

This PR introduces support for using Nomad on systems with cgroups v2 [1] enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems for Nomad users. Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer, but not so for managing cpuset cgroups. Before, Nomad has been making use of a feature in v1 where a PID could be a member of more than one cgroup. In v2 this is no longer possible, and so the logic around computing cpuset values must be modified. When Nomad detects v2, it manages cpuset values in-process, rather than making use of cgroup heirarchy inheritence via shared/reserved parents. Nomad will only activate the v2 logic when it detects cgroups2 is mounted at /sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2 mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to use the v1 logic, and should operate as before. Systems that do not support cgroups v2 are also not affected. When v2 is activated, Nomad will create a parent called nomad.slice (unless otherwise configured in Client conifg), and create cgroups for tasks using naming convention <allocID>-<task>.scope. These follow the naming convention set by systemd and also used by Docker when cgroups v2 is detected. Client nodes now export a new fingerprint attribute, unique.cgroups.version which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by Nomad. The new cpuset management strategy fixes #11705, where docker tasks that spawned processes on startup would "leak". In cgroups v2, the PIDs are started in the cgroup they will always live in, and thus the cause of the leak is eliminated. [1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html Closes #11289 Fixes #11705 #11773 #11933

tgross

This is looking great @shoenig! I've reviewed the driver and fingerprinter changes at this point and have started working thru the cgutil package changes. I'll have to finish my review tomorrow but I wanted to save my review in the meantime so I'll mark it as a comment.

tgross · 2022-03-23T19:18:57Z

website/content/docs/upgrade/upgrade-specific.mdx

+Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are
+now supported. A Nomad client will only activate its v2 control groups manager if the
+system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies
+Nomad will continue to fallback to the v1 control groups manager on systems
+configured to run in hybrid mode, where the cgroups2 controller is typically mounted
+at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A
+new client attribute `unique.cgroup.version` indicates which version of control groups
+Nomad is using.


This is 100% accurate but I think it might be missing whether the user needs to do anything in particular here, especially for users who aren't really solid on just what cgroup v1 vs v2 is. I'd even consider breaking it down into bullet points:

Suggested change

Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are

now supported. A Nomad client will only activate its v2 control groups manager if the

system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies

Nomad will continue to fallback to the v1 control groups manager on systems

configured to run in hybrid mode, where the cgroups2 controller is typically mounted

at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A

new client attribute `unique.cgroup.version` indicates which version of control groups

Nomad is using.

Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2]

are now supported. A Nomad client will only activate its v2 control groups manager

if the system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`.

* Systems that do not support cgroups v2 are not affected.

* Hosts configured in hybrid mode typically mount the cgroups2

controller at `/sys/fs/cgroup/unified`, so Nomad will continue to

use cgroups v1 for these hosts.

* Hosts configured for only cgroups v2 will now correctly support

`cpuset`.

Nomad will preserve the existing cgroup for tasks when a client is

upgraded, so there will be no disruption to tasks. A new client

attribute `unique.cgroup.version` indicates which version of control

groups Nomad is using.

tgross · 2022-03-23T19:20:16Z

website/content/docs/upgrade/upgrade-specific.mdx

+These cgroups are created by Nomad before a task starts. External task drivers that support
+containerization should be updated to make use of the new cgroup locations.
+
+```


Suggested change

```

The new cgroup file system layout will look like the following:

```shell-session

tgross · 2022-03-23T19:34:02Z

client/testutil/driver_compatible_linux.go

+
+func cgroupsCompatibleV1(t *testing.T) bool {
+	if runtime.GOOS != "linux" {
+		return false


Unreachable given we have the build flag on this file, but maybe ok to leave in anyways if you think it makes the code clearer?

removed in favor of a comment reminding of the build tag

tgross · 2022-03-23T19:52:56Z

client/lib/cgutil/cgutil_linux.go

+// GetCgroupParent returns the mount point under the root cgroup in which Nomad
+// will create cgroups. If parent is not set, an appropriate name for the version
+// of cgroups will be used.
+func GetCgroupParent(parent string) string {


Is this ever not ""?

Good catch! May as well make use of it and cleanup direct calls to getParentV1/V2

tgross · 2022-03-23T20:06:25Z

client/fingerprint/cgroup_linux.go

 		if f.lastState == cgroupAvailable {
-			f.logger.Info("cgroups are unavailable")
+			f.logger.Warn("cgroups are now unavailable")


That seems like a bad situation to be in! 😀 But it looks like you've got the cpuset fixer handling that as gracefully as we can, so 👍

tgross · 2022-03-23T20:16:43Z

drivers/docker/reconcile_cpuset.go

+// Due to Docker not allowing the configuration of the full cgroup path, we must
+// manually fix the cpuset values for all docker containers continuously, as the
+// values will change as tasks of any driver using reserved cores are started and
+// stopped, changing the size of the remaining shared cpu pool.


Just for my clarity, in the case of #11705 with cgroups v2: any child processes will be in the same cgroup as their parent that Docker starts. So although they won't be pinned to the right CPU for a very brief window at startup, they'll all get moved together to the correct cpuset because we're not moving processes between cgroups, just changing the cgroup in place.

Is my understanding right?

Yup! And actually this is probably worth amending the RFC and reconsidering.

Although this behavior is most similar to the way V1 works, I don't see why it wouldn't be worth just setting the cpuset on the docker task config on initial startup now. The drawback is I don't think we have the plumbing to get the shared cores - the initial value would only be the reserved cores requested by the task resources.

It seems like it could be easier for us to understand that way, for sure.

I was going to say that some applications fingerprint cores at startup and not again afterwards (ex. GOMAXPROCS). But they could potentially take a hit either way: from not having enough threads if we have the initial value only be reserved, or from contention if we don't. I suspect if you care about this, you probably also need to care enough to set the values right for the application as well.

Our trading infrastructure reads the cpus in the cgroup on startup and index into the cpuset to pin threads to specific cores. If the cgroup gets modified after we've done this thread <-> core pinning, I'm not entirely sure what happens but it's bad as either threads will still be pinned to CPUs outside of the allocated cgroup or it just breaks altogether.

cf: #12374

tgross · 2022-03-23T20:25:14Z

drivers/docker/reconcile_cpuset.go

+func (cf *cpusetFixer) fix(c coordinate) {
+	source := c.NomadCgroup()
+	destination := c.DockerCgroup()
+	if err := cgutil.CopyCpuset(source, destination); err != nil {
+		cf.logger.Debug("failed to copy cpuset", "err", err)
+	}
+}


I really like the design here where we didn't end up plumbing any management code into the driver, but are just copying the cpu sets between the Nomad-managed cgroup and the Docker-managed one.

Still waiting to hear back on moby/moby#43363

Which if that would work, all this goes away!

tgross · 2022-03-23T20:30:42Z

drivers/shared/executor/executor_linux_test.go

-				return false, fmt.Errorf("Not a member of the alloc's cgroup: expected=...:/nomad/... -- found=%q", line)
+				// Skip rdma subsystem; rdma was added in most recent kernels and libcontainer/docker
+				// don't isolate it by default.
+				if strings.Contains(line, ":rdma:") || strings.Contains(line, "::") {


Should we skip :misc: here too?

tgross · 2022-03-23T20:35:19Z

plugins/drivers/testutils/exec_testing.go

+				// github actions freezer cgroup
+				acceptable = append(acceptable, ":freezer:/actions_job")


Oof, I imagine you discovered this experimentally and it wasn't documented? Probably nothing we can do about it but some day that's going to change out from under us and break a bunch of tests. So at least you've documented it! 😀

Heh yeah, it's really got me ~~concerned~~ curious how GHA works under the hood ...

tgross

This looks great @shoenig, nice work! I've left a few small comments and questions.

tgross · 2022-03-24T13:43:49Z

client/lib/cgutil/cgutil_linux_test.go

+		create(t, source)
+		defer cleanup(t, source)


Nitpicky: the mgr.Apply call isn't an atomic change. It can create the paths before the leaf successfully and then return an error for its final step. So if we call require.NoError in create we may never call the cleanup function. Maybe create can clean itself up on failure, and then do a require.NoError to fail the test?

ref cgroups/fs2/fs2.go#L65-L85

good catch, fixed

tgross · 2022-03-24T13:48:27Z

client/lib/cgutil/cpuset_manager_v2_test.go

+	"github.com/stretchr/testify/require"
+)
+
+// Note: these tests need to run on GitHub Actions runners with only 2 cores.


Is it "at least 2 cores"? It'd be good to have a testutil for skipping this based on the number of cores, because otherwise folks might end up running on single-core Vagrant boxes and failing unexpectedly.

good idea, added

tgross · 2022-03-24T14:11:26Z

client/lib/cgutil/cgutil_linux.go

+// CgroupScope returns the name of the scope for Nomad's managed cgroups for
+// the given allocID and task.
+//
+// e.g. "<allocID>-<task>.scope"


I think we want a dot to match what we've done in the code everywhere:

Suggested change

// e.g. "<allocID>-<task>.scope"

// e.g. "<allocID>.<task>.scope"

tgross · 2022-03-24T14:13:35Z

client/lib/cgutil/cpuset_manager.go

+// identity is the "<allocID>.<taskName>" string that uniquely identifies an
+// individual instance of a task within the flat cgroup namespace
+type identity string


I ❤️ type-aliased IDs

tgross · 2022-03-24T14:19:16Z

client/lib/cgutil/cpuset_manager_v2.go

+	// rootless is (for now) always false; Nomad clients require root, so we
+	// assume to not need to do the extra plumbing for rootless cgroups.
+	rootless = false


It looks like the version of libcontainer we have would return an error if we passed rootless = true anyways. And more recent versions of libcontainer/cgroups/fs2 drop this parameter entirely. Not sure whether we want to update the comment to reflect we know that it's safe to remove when we bump versions of libcontainer?

opened and noted in #12372

tgross · 2022-03-24T14:33:42Z

client/lib/cgutil/cpuset_manager_v2.go

+// null represents nothing
+var null = nothing{}


I like what you're doing here but calling it null means we do c.sharing[id] = null to add an ID to the sharing set, which reads almost the opposite of what you're going for. Maybe if it were c.sharing[id] = ok or something like that?

(What I wouldn't do for a real set type in the stdlib!)

renamed null to present to be more clear

(What I wouldn't do for a real set type in the stdlib!)

+10000000000000000000000

tgross · 2022-03-24T14:46:37Z

client/lib/cgutil/cpuset_manager_v2.go

+// We avoid removing a cgroup if it still contains a PID, as the cpuset manager
+// may be initially empty on a Nomad client restart.


I don't think we need to do it for this PR, but it might be nice if we can eventually figure out a way to clean up stray cgroup entries. The allocdir hook is the only hook that comes before the cgroup hook, but if its postrun hook throws an error I think we'll never clean up the cgroup entry for that PID?

This is a good point; it's a similar problem for other resources like networks too, right?

Yeah #11096 and #6385.

tgross · 2022-03-24T14:50:55Z

client/lib/cgutil/cpuset_manager_v1.go

 			if taskInfo.Error != nil {
 				break
 			}
+
+			timer.Reset(100 * time.Millisecond)


We could move this into the timer.C case below, I think. Or do we pretty much always need the initial 100ms?

github-actions · 2022-10-24T02:45:50Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

shoenig force-pushed the f-cgroupsv2 branch from e66a0e4 to 252cb97 Compare March 16, 2022 14:40

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 14:40 Failure

vercel bot temporarily deployed to Preview – nomad March 16, 2022 14:40 Inactive

vercel bot temporarily deployed to Preview – nomad March 16, 2022 19:10 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 19:10 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 20:00 Failure

vercel bot temporarily deployed to Preview – nomad March 16, 2022 20:00 Inactive

vercel bot temporarily deployed to Preview – nomad March 16, 2022 20:49 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 20:49 Failure

shoenig force-pushed the f-cgroupsv2 branch from f3d64ac to 76e5086 Compare March 16, 2022 21:03

vercel bot temporarily deployed to Preview – nomad March 16, 2022 21:03 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 21:03 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 21:47 Failure

vercel bot temporarily deployed to Preview – nomad March 16, 2022 21:47 Inactive

vercel bot temporarily deployed to Preview – nomad March 16, 2022 22:27 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 22:27 Failure

vercel bot temporarily deployed to Preview – nomad March 16, 2022 22:35 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 16, 2022 22:39 Failure

vercel bot temporarily deployed to Preview – nomad March 17, 2022 13:15 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 13:15 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 15:29 Failure

vercel bot temporarily deployed to Preview – nomad March 17, 2022 15:29 Inactive

shoenig force-pushed the f-cgroupsv2 branch from b4b177d to c241a24 Compare March 17, 2022 18:53

vercel bot temporarily deployed to Preview – nomad March 17, 2022 18:53 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 18:53 Failure

vercel bot temporarily deployed to Preview – nomad March 17, 2022 19:00 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 19:00 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 21:29 Failure

vercel bot temporarily deployed to Preview – nomad March 17, 2022 21:29 Inactive

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 17, 2022 21:44 Failure

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 23, 2022 14:02 Failure

shoenig force-pushed the f-cgroupsv2 branch from c65cbf1 to 55febe8 Compare March 23, 2022 14:18

vercel bot deployed to Preview – nomad March 23, 2022 14:18 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 23, 2022 14:18 Failure

shoenig mentioned this pull request Mar 23, 2022

Add support for cpuset cgroups v2 hashicorp/nomad-driver-podman#160

Closed

shoenig force-pushed the f-cgroupsv2 branch from 55febe8 to d43717f Compare March 23, 2022 14:57

vercel bot deployed to Preview – nomad March 23, 2022 14:57 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 23, 2022 14:57 Failure

shoenig force-pushed the f-cgroupsv2 branch from d43717f to c41d099 Compare March 23, 2022 15:25

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 23, 2022 15:25 Failure

vercel bot deployed to Preview – nomad March 23, 2022 15:25 View deployment

shoenig force-pushed the f-cgroupsv2 branch from c41d099 to 5da1a31 Compare March 23, 2022 16:35

vercel bot deployed to Preview – nomad March 23, 2022 16:35 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 23, 2022 16:35 Failure

shoenig changed the title ~~[WIP] client: enable cpuset support for cgroups.v2~~ client: enable cpuset support for cgroups.v2 Mar 23, 2022

shoenig marked this pull request as ready for review March 23, 2022 18:02

shoenig requested review from schmichael and tgross March 23, 2022 18:03

tgross reviewed Mar 23, 2022

View reviewed changes

tgross approved these changes Mar 24, 2022

View reviewed changes

client: cgroups v2 code review followup

c27af79

vercel bot deployed to Preview – nomad March 24, 2022 18:40 View deployment

vercel bot had a problem deploying to Preview – nomad-storybook-and-ui March 24, 2022 18:40 Failure

shoenig merged commit 8e77776 into main Mar 24, 2022

shoenig deleted the f-cgroupsv2 branch March 24, 2022 19:22

aholyoake-bc mentioned this pull request Mar 28, 2022

cgroups: consider pre-setting cpuset in docker driver #12374

Closed

djenriquez mentioned this pull request May 16, 2022

1.3.0 container memory constraints not in effect leading to OOMs #13031

Closed

github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022

-Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2] are
-now supported. A Nomad client will only activate its v2 control groups manager if the
-system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`. This implies
-Nomad will continue to fallback to the v1 control groups manager on systems
-configured to run in hybrid mode, where the cgroups2 controller is typically mounted
-at `/sys/fs/cgroup/unified`. Systems that do not support cgroups v2 are not affected. A
-new client attribute `unique.cgroup.version` indicates which version of control groups
-Nomad is using.
+Starting with Nomad 1.3.0, Linux systems configured to use [cgroups v2][cgroups2]
+are now supported. A Nomad client will only activate its v2 control groups manager
+if the system is configured with the cgroups2 controller mounted at `/sys/fs/cgroup`.
+* Systems that do not support cgroups v2 are not affected.
+* Hosts configured in hybrid mode typically mount the cgroups2
+  controller at `/sys/fs/cgroup/unified`, so Nomad will continue to
+  use cgroups v1 for these hosts.
+* Hosts configured for only cgroups v2 will now correctly support
+  `cpuset`.
+Nomad will preserve the existing cgroup for tasks when a client is
+upgraded, so there will be no disruption to tasks. A new client
+attribute `unique.cgroup.version` indicates which version of control
+groups Nomad is using.

-```
+The new cgroup file system layout will look like the following:
+```shell-session

		// github actions freezer cgroup
		acceptable = append(acceptable, ":freezer:/actions_job")

	// e.g. "<allocID>-<task>.scope"
	// e.g. "<allocID>.<task>.scope"

		// We avoid removing a cgroup if it still contains a PID, as the cpuset manager
		// may be initially empty on a Nomad client restart.

client: enable cpuset support for cgroups.v2 #12274

client: enable cpuset support for cgroups.v2 #12274

Conversation

shoenig commented Mar 14, 2022 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 24, 2022

shoenig commented Mar 14, 2022 •

edited

Loading