cpuset: `no space left on device` #23405

rodrigol-chan · 2024-06-21T08:26:04Z

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

Running Ubuntu 22.04 on Google Cloud in an n2d-standard-32 instance.

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Issue

Alerts fired due to failed allocations. Upon investigation, I noticed the following log line:

{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:58:45.224880+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}

Also interesting to observe is that, unlike in our other 1.7.x clients, there's overlap between the CPUs for the reserve and share slices:

$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus
==> /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus <==
0-3

==> /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus <==
0-31

Reproduction steps

Not clear how to reproduce. This happened on a single instance. All allocations that failed are from periodic jobs, running with on the exec driver with no core constraints.

Expected Result

Allocations spawn successfully.

Actual Result

Allocations failed to spawn.

Nomad Client logs (if appropriate)

{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:56:52.170066+02:00","alloc_id":"7a39078a-0769-d3e9-38a0-c706ff516de8","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:58:45.224880+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:58:45.224934+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:58:45.226936+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:58:45.228745+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"processing","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:58:45.241815+02:00","alloc_id":"84407c34-35e1-b0cb-4a8f-f3b6c9a8cc81","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:59:45.699245+02:00","alloc_id":"c506511a-eb17-73ee-7164-aaa9df390f47","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:59:45.699331+02:00","alloc_id":"c506511a-eb17-73ee-7164-aaa9df390f47","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"processing","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:59:45.701453+02:00","alloc_id":"c506511a-eb17-73ee-7164-aaa9df390f47","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T09:59:45.703532+02:00","alloc_id":"c506511a-eb17-73ee-7164-aaa9df390f47","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T09:59:45.717471+02:00","alloc_id":"c506511a-eb17-73ee-7164-aaa9df390f47","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:00.588677+02:00","alloc_id":"69725229-e961-9496-917c-d81b21aac9d8","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.588716+02:00","alloc_id":"69725229-e961-9496-917c-d81b21aac9d8","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"maintenance-timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.594967+02:00","alloc_id":"69725229-e961-9496-917c-d81b21aac9d8","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.599704+02:00","alloc_id":"69725229-e961-9496-917c-d81b21aac9d8","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:00.617996+02:00","alloc_id":"69725229-e961-9496-917c-d81b21aac9d8","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:00.944803+02:00","alloc_id":"ac468e4e-a551-2991-a4fc-b8fa64599552","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.944850+02:00","alloc_id":"ac468e4e-a551-2991-a4fc-b8fa64599552","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.950686+02:00","alloc_id":"ac468e4e-a551-2991-a4fc-b8fa64599552","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:00.954145+02:00","alloc_id":"ac468e4e-a551-2991-a4fc-b8fa64599552","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:00.971800+02:00","alloc_id":"ac468e4e-a551-2991-a4fc-b8fa64599552","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:01.280014+02:00","alloc_id":"4b3df1d6-87a0-b0f6-38c6-b4e305d1f3da","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.280062+02:00","alloc_id":"4b3df1d6-87a0-b0f6-38c6-b4e305d1f3da","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"maintenance-timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.284932+02:00","alloc_id":"4b3df1d6-87a0-b0f6-38c6-b4e305d1f3da","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.290230+02:00","alloc_id":"4b3df1d6-87a0-b0f6-38c6-b4e305d1f3da","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:01.305828+02:00","alloc_id":"4b3df1d6-87a0-b0f6-38c6-b4e305d1f3da","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:01.604525+02:00","alloc_id":"7fa1997a-9aa0-50cf-bb01-85168b5e7ba0","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.604609+02:00","alloc_id":"7fa1997a-9aa0-50cf-bb01-85168b5e7ba0","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.609268+02:00","alloc_id":"7fa1997a-9aa0-50cf-bb01-85168b5e7ba0","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:00:01.615208+02:00","alloc_id":"7fa1997a-9aa0-50cf-bb01-85168b5e7ba0","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:00:01.625727+02:00","alloc_id":"7fa1997a-9aa0-50cf-bb01-85168b5e7ba0","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:01:00.615323+02:00","alloc_id":"727d4eeb-b135-3a31-1765-133146f2cd7f","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:01:00.615380+02:00","alloc_id":"727d4eeb-b135-3a31-1765-133146f2cd7f","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:01:00.617829+02:00","alloc_id":"727d4eeb-b135-3a31-1765-133146f2cd7f","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:01:00.619727+02:00","alloc_id":"727d4eeb-b135-3a31-1765-133146f2cd7f","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:01:00.633596+02:00","alloc_id":"727d4eeb-b135-3a31-1765-133146f2cd7f","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:02:00.614056+02:00","alloc_id":"e1935c49-11a0-579c-5ab5-244bc931f0ad","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:02:00.614102+02:00","alloc_id":"e1935c49-11a0-579c-5ab5-244bc931f0ad","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:02:00.615991+02:00","alloc_id":"e1935c49-11a0-579c-5ab5-244bc931f0ad","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-06-21T10:02:00.617758+02:00","alloc_id":"e1935c49-11a0-579c-5ab5-244bc931f0ad","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-06-21T10:02:00.629413+02:00","alloc_id":"e1935c49-11a0-579c-5ab5-244bc931f0ad","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}

Nomad client configuration

client {
  gc_max_allocs = 300
  gc_disk_usage_threshold = 80
}

The text was updated successfully, but these errors were encountered:

tgross · 2024-06-21T18:53:48Z

Hi @rodrigol-chan! Sorry to hear you're running into trouble. The error you're getting here is particularly weird:

write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device

We're writing to the /sys/fs/cgroup mount, which is a virtual file system! The only way I can think of for this to happen is if we've written a ton of inodes to the cgroup and haven't been cleaning them up correctly. What do you get if you cat the /proc/cgroups virtual file?

$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       230     1
cpu     0       230     1
cpuacct 0       230     1
blkio   0       230     1
memory  0       230     1
devices 0       230     1
freezer 0       230     1
net_cls 0       230     1
perf_event      0       230     1
net_prio        0       230     1
hugetlb 0       230     1
pids    0       230     1
rdma    0       230     1
misc    0       230     1

rodrigol-chan · 2024-06-24T07:49:10Z

It happened again just now, on a different machine.

$ cat /proc/cgroups
#subsys_name    hierarchy       num_cgroups     enabled
cpuset  0       3953    1
cpu     0       3953    1
cpuacct 0       3953    1
blkio   0       3953    1
memory  0       3953    1
devices 0       3953    1
freezer 0       3953    1
net_cls 0       3953    1
perf_event      0       3953    1
net_prio        0       3953    1
hugetlb 0       3953    1
pids    0       3953    1
rdma    0       3953    1
misc    0       3953    1

This Nomad client configuration now looks relevant:

client {
  gc_max_allocs = 300
  gc_disk_usage_threshold = 80
}

And currently have over 300 allocations:

$ sudo ls -1 /var/lib/nomad/alloc | wc -l
374

Nomad seems to be keeping a lot of tmpfss around even if the allocations aren't running anymore. I'm not sure if that's by design.

$ df -t tmpfs | wc -l
407

rodrigol-chan · 2024-06-24T08:20:22Z

For extra context: issue seems new with the 1.7.x upgrade. We've run this configuration in 1.6.x for about 8 months with no similar issues.

tgross · 2024-06-24T20:00:10Z

Thanks for that extra info @rodrigol-chan. Even with that large number of allocs, I'd think you'd be ok until you get to 65535 inodes. I'll dig into that a little further to see if there's some more /sys or /proc filesystem spelunking we can do here.

Nomad seems to be keeping a lot of tmpfss around even if the allocations aren't running anymore. I'm not sure if that's by design.

The mounts are left in place until the allocation is GC'd on the client. We do that so that you can debug failed allocations.

rodrigol-chan · 2024-09-13T12:54:55Z

The issue still happens as of 1.8.3. Is there anything we can do to help troubleshoot this?

tgross · 2024-09-16T13:08:13Z

Hi @rodrigol-chan, sorry, I haven't been able to circle back to this and I'm currently swamped trying to land some work for our 1.9 beta next week.

I suspect this is platform-specific. I think you'll want to look into whether there's anything in the host configuration that could be limiting the size of those virtual FS directories.

tgross · 2024-09-27T20:28:14Z

Hi @rodrigol-chan! Just wanted to check in so you don't think I've forgotten this issue. I re-read through your initial report to see if there were any clues I missed.

Also interesting to observe is that, unlike in our other 1.7.x clients, there's overlap between the CPUs for the reserve and share slices:

Even ignoring the errors you're seeing, that's got to be a bug all by itself. These should never overlap. Even though we can't write to the two files atomically, we always remove from the source first and then write to the destination. So in that tiny race you should see a missing CPU but not one counted twice. So I'll look into seeing if I can find any place where there's potentially another race condition here where that's not correctly handled.

All allocations that failed are from periodic jobs, running with on the exec driver with no core constraints.

You have other allocations on the same host that do use core constraints though? If not, we're writing an empty value to the cgroup. In which case, I found this Stack Exchange post which describes that scenario, but has no answer. 🤦

I managed to dig up a few old issues that suggest that if cpuset.mem doesn't exist in the cgroup directory, then you can't write to cpuset.cpus either, but I also can't create a scenario where it wouldn't exist. Just creating a new directory with something like mkdir /sys/fs/cgroup/nomad.slice/new.slice makes it show up for me and you can't remove it.

Also, I wanted to see if I could get this error outside of Nomad by echoing a bad input to the cgroup file, and wasn't able to get that same error.

input	result	error
`" "`	unset	-
`""`	unset	-
`-1`	-	`write error: Invalid argument`
`2`	-	`write error: Numerical result out of range`
`2-1`	-	`write error: Invalid argument`
`a`	-	`write error: Invalid argument`
`,0`	0	-
`1,`	1	-
`0-0`	0	-
`0-a`	-	`write error: Invalid argument`

I did get some interesting (but different) errors trying to write to the nomad.slice/cpuset.cpus

# cat /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus
1
# cat /sys/fs/cgroup/nomad.slice/cpuset.cpus
0
# cat /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus
1
# echo 0 > /sys/fs/cgroup/nomad.slice/cpuset.cpus
bash: echo: write error: Device or resource busy

One more thing I'd like you to try is the following, to make sure we've counted the cgroups correctly when trying to figure out if its the inodes issue:

# find /sys/fs/cgroup/ | wc -l
2965
# find /sys/fs/cgroup/nomad.slice | wc -l
147

rodrigol-chan · 2024-09-30T14:53:00Z

You have other allocations on the same host that do use core constraints though?

That's correct.

One more thing I'd like you to try is the following, to make sure we've counted the cgroups correctly when trying to figure out if its the inodes issue.

Just happened again:

# find /sys/fs/cgroup -depth -type d | wc -l
81
# find /sys/fs/cgroup/nomad.slice -depth -type d | wc -l
29
# head /sys/fs/cgroup/nomad.slice/cpuset.cpus /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus
==> /sys/fs/cgroup/nomad.slice/cpuset.cpus <==
0-31                          

==> /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus <==
0-3

==> /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus <==
4-31

Log output:

{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:26:17.441673+02:00","alloc_id":"ff98bb16-7a4e-2b4d-f8d6-584d767dd1bf","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:26:18.425610+02:00","alloc_id":"682bf2a1-d3d9-417e-1325-c4e2ffc56185","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:27:00.871024+02:00","alloc_id":"d6108b42-84de-5da3-0fd7-f902107d6069","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:27:00.871102+02:00","alloc_id":"d6108b42-84de-5da3-0fd7-f902107d6069","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:27:00.873250+02:00","alloc_id":"d6108b42-84de-5da3-0fd7-f902107d6069","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:27:00.875331+02:00","alloc_id":"d6108b42-84de-5da3-0fd7-f902107d6069","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:27:00.891020+02:00","alloc_id":"d6108b42-84de-5da3-0fd7-f902107d6069","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:28:00.461064+02:00","alloc_id":"c4f3dcf4-dfbd-6a10-8a4a-59e7dd2c7cf2","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:00.461112+02:00","alloc_id":"c4f3dcf4-dfbd-6a10-8a4a-59e7dd2c7cf2","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:00.463403+02:00","alloc_id":"c4f3dcf4-dfbd-6a10-8a4a-59e7dd2c7cf2","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:00.465462+02:00","alloc_id":"c4f3dcf4-dfbd-6a10-8a4a-59e7dd2c7cf2","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:28:04.872027+02:00","alloc_id":"c4f3dcf4-dfbd-6a10-8a4a-59e7dd2c7cf2","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:28:51.538329+02:00","alloc_id":"8c0fbbff-6823-c309-e64f-5a107fad5f9b","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:51.538372+02:00","alloc_id":"8c0fbbff-6823-c309-e64f-5a107fad5f9b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"hulppiet-processing","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:51.540765+02:00","alloc_id":"8c0fbbff-6823-c309-e64f-5a107fad5f9b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:28:51.543117+02:00","alloc_id":"8c0fbbff-6823-c309-e64f-5a107fad5f9b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:28:51.559830+02:00","alloc_id":"8c0fbbff-6823-c309-e64f-5a107fad5f9b","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:29:33.656064+02:00","alloc_id":"aafaa858-254c-b7c1-608d-b475eac076df","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.186942+02:00","alloc_id":"fcf42431-4866-956f-0fd6-cfb3bb0bc6f7","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.186998+02:00","alloc_id":"fcf42431-4866-956f-0fd6-cfb3bb0bc6f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.189414+02:00","alloc_id":"fcf42431-4866-956f-0fd6-cfb3bb0bc6f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.191631+02:00","alloc_id":"fcf42431-4866-956f-0fd6-cfb3bb0bc6f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.616165+02:00","alloc_id":"4104ffa6-61d6-bd2b-c4c6-16dc67fc6101","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.616259+02:00","alloc_id":"4104ffa6-61d6-bd2b-c4c6-16dc67fc6101","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"sharkmachine-db-maintenance","type":"Setup Failure"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.618142+02:00","alloc_id":"e5fea182-c573-1165-beef-1bb28b54457b","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.618185+02:00","alloc_id":"e5fea182-c573-1165-beef-1bb28b54457b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.622550+02:00","alloc_id":"4104ffa6-61d6-bd2b-c4c6-16dc67fc6101","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.624860+02:00","alloc_id":"e5fea182-c573-1165-beef-1bb28b54457b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"requestmachine-timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.630967+02:00","alloc_id":"4104ffa6-61d6-bd2b-c4c6-16dc67fc6101","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.633081+02:00","alloc_id":"e5fea182-c573-1165-beef-1bb28b54457b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.666757+02:00","alloc_id":"e5fea182-c573-1165-beef-1bb28b54457b","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.668936+02:00","alloc_id":"4104ffa6-61d6-bd2b-c4c6-16dc67fc6101","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:00.978237+02:00","alloc_id":"b4282dcd-db53-0b92-7460-4948414fdc46","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.978283+02:00","alloc_id":"b4282dcd-db53-0b92-7460-4948414fdc46","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.982879+02:00","alloc_id":"b4282dcd-db53-0b92-7460-4948414fdc46","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:00.989391+02:00","alloc_id":"b4282dcd-db53-0b92-7460-4948414fdc46","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:01.338338+02:00","alloc_id":"fb073ab4-0928-dd31-af16-9d09356df977","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.338406+02:00","alloc_id":"fb073ab4-0928-dd31-af16-9d09356df977","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"imaginator-maintenance-timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.342794+02:00","alloc_id":"fb073ab4-0928-dd31-af16-9d09356df977","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.349456+02:00","alloc_id":"fb073ab4-0928-dd31-af16-9d09356df977","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:01.373888+02:00","alloc_id":"fb073ab4-0928-dd31-af16-9d09356df977","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:01.698021+02:00","alloc_id":"29a30585-27f8-6fd0-710c-4ff80df5f7f7","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.698061+02:00","alloc_id":"29a30585-27f8-6fd0-710c-4ff80df5f7f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.702273+02:00","alloc_id":"29a30585-27f8-6fd0-710c-4ff80df5f7f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:30:01.707199+02:00","alloc_id":"29a30585-27f8-6fd0-710c-4ff80df5f7f7","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:01.722370+02:00","alloc_id":"29a30585-27f8-6fd0-710c-4ff80df5f7f7","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:04.591155+02:00","alloc_id":"fcf42431-4866-956f-0fd6-cfb3bb0bc6f7","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:30:05.415728+02:00","alloc_id":"b4282dcd-db53-0b92-7460-4948414fdc46","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:31:00.682071+02:00","alloc_id":"1acbe5ca-f674-40f5-2eff-9c19dbd388ed","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:31:00.682114+02:00","alloc_id":"1acbe5ca-f674-40f5-2eff-9c19dbd388ed","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:31:00.684033+02:00","alloc_id":"1acbe5ca-f674-40f5-2eff-9c19dbd388ed","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:31:00.686174+02:00","alloc_id":"1acbe5ca-f674-40f5-2eff-9c19dbd388ed","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:31:00.703343+02:00","alloc_id":"1acbe5ca-f674-40f5-2eff-9c19dbd388ed","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"error","@message":"prerun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:32:00.641821+02:00","alloc_id":"9bfa85eb-c32f-d688-33ed-eb2705f66a5b","error":"pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:32:00.641890+02:00","alloc_id":"9bfa85eb-c32f-d688-33ed-eb2705f66a5b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"timer","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:32:00.644153+02:00","alloc_id":"9bfa85eb-c32f-d688-33ed-eb2705f66a5b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"nix-setup-profiles","type":"Setup Failure"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-09-30T16:32:00.646231+02:00","alloc_id":"9bfa85eb-c32f-d688-33ed-eb2705f66a5b","failed":true,"msg":"failed to setup alloc: pre-run hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device","task":"promtail","type":"Setup Failure"}
{"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-09-30T16:32:00.663011+02:00","alloc_id":"9bfa85eb-c32f-d688-33ed-eb2705f66a5b","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}

It doesn't look like the CPUs overlapped this time. The number of dying descendants is curious, I wonder if it's related:

# head /sys/fs/cgroup/nomad.slice/cgroup.stat /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat /sys/fs/cgroup/nomad.slice/share.slice/cgroup.stat 
==> /sys/fs/cgroup/nomad.slice/cgroup.stat <==
nr_descendants 28
nr_dying_descendants 2356

==> /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat <==
nr_descendants 1
nr_dying_descendants 78

==> /sys/fs/cgroup/nomad.slice/share.slice/cgroup.stat <==
nr_descendants 25
nr_dying_descendants 2278

tgross · 2024-10-03T15:43:45Z

Can you confirm whether the cpuset.mem file exists in the reserve.slice? And what's cgroup.max.descendants set to? Example:

$ cat /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.max.descendants
max

rodrigol-chan · 2024-10-04T06:58:55Z

And what's cgroup.max.descendants set to?

I did look at that at failure time and from memory it was at max.

Can you confirm whether the cpuset.mem file exists in the reserve.slice?

I can't find any cpuset.mem anywhere, did you mean cpuset.mems? This latter one is present and seems empty for all allocations as far as I can see. I somehow missed this, but this is indeed a machine with NUMA. We widely use memory blocks but none with numa configuration.

$ lsmem -o +NODE
RANGE                                 SIZE  STATE REMOVABLE  BLOCK NODE
0x0000000000000000-0x00000000bfffffff   3G online       yes    0-2    0
0x0000000100000000-0x000000103fffffff  61G online       yes   4-64    0
0x0000001040000000-0x000000203fffffff  64G online       yes 65-128    1

Memory block size:         1G
Total online memory:     128G
Total offline memory:      0B

I'll doublecheck cgroup.max.descendants and cgroup.mem{s,} as soon as it happens again and update this issue. Thanks again for looking into this!

rodrigol-chan · 2024-10-04T13:40:37Z

Just happened again. (It has been happening strangely often lately.) Here are the values requested. cpuset.mems always seems to be present and empty in all cgroups managed by Nomad.

$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.max.descendants
max
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat
nr_descendants 1
nr_dying_descendants 11
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus
0-3
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.mems
$

This doesn't look like it should be possible, though:

$ head /sys/fs/cgroup/nomad.slice/cpuset.cpus
0-31
$ head /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus
4-31
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus
0-3
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.realtime-gunicorn.scope/cpuset.cpus
4-7

It might just be an artifact of how the data is collected since I don't think it's possible to do an atomic snapshot of cgroups.

All Nomad cgroups

2024-10-04 15:23:33.037 {"@level":"error","@message":"postrun failed","@module":"client.alloc_runner","@timestamp":"2024-10-04T15:23:33.037116+02:00","alloc_id":"a7afbf13-6bde-4d59-9dbf-671919ee2b3a","error":"hook \"cpuparts_hook\" failed: write /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus: no space left on device"}
2024-10-04 15:23:33.269 ==> /sys/fs/cgroup/nomad.slice/cgroup.max.descendants <==
2024-10-04 15:23:33.269 max
2024-10-04 15:23:33.269 ==> /sys/fs/cgroup/nomad.slice/cgroup.stat <==
2024-10-04 15:23:33.269 nr_descendants 28
2024-10-04 15:23:33.269 nr_dying_descendants 939
2024-10-04 15:23:33.269 ==> /sys/fs/cgroup/nomad.slice/cpuset.cpus <==
2024-10-04 15:23:33.269 0-31
2024-10-04 15:23:33.269 ==> /sys/fs/cgroup/nomad.slice/cpuset.mems <==
2024-10-04 15:23:33.272 ==> /sys/fs/cgroup/nomad.slice/share.slice/cgroup.max.descendants <==
2024-10-04 15:23:33.272 max
2024-10-04 15:23:33.272 ==> /sys/fs/cgroup/nomad.slice/share.slice/cgroup.stat <==
2024-10-04 15:23:33.272 nr_descendants 25
2024-10-04 15:23:33.272 nr_dying_descendants 928
2024-10-04 15:23:33.272 ==> /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus <==
2024-10-04 15:23:33.272 4-31
2024-10-04 15:23:33.272 ==> /sys/fs/cgroup/nomad.slice/share.slice/cpuset.mems <==
2024-10-04 15:23:33.273 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.273 max
2024-10-04 15:23:33.273 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.273 nr_descendants 0
2024-10-04 15:23:33.273 nr_dying_descendants 0
2024-10-04 15:23:33.273 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.273 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.274 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.274 max
2024-10-04 15:23:33.274 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.274 nr_descendants 0
2024-10-04 15:23:33.274 nr_dying_descendants 0
2024-10-04 15:23:33.274 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.274 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.276 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.276 max
2024-10-04 15:23:33.276 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.276 nr_descendants 0
2024-10-04 15:23:33.276 nr_dying_descendants 0
2024-10-04 15:23:33.276 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.276 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.277 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.realtime-taskrunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.277 max
2024-10-04 15:23:33.277 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.realtime-taskrunner.scope/cgroup.stat <==
2024-10-04 15:23:33.277 nr_descendants 0
2024-10-04 15:23:33.277 nr_dying_descendants 0
2024-10-04 15:23:33.277 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.realtime-taskrunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.277 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.realtime-taskrunner.scope/cpuset.mems <==
2024-10-04 15:23:33.278 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.278 max
2024-10-04 15:23:33.278 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.278 nr_descendants 0
2024-10-04 15:23:33.278 nr_dying_descendants 0
2024-10-04 15:23:33.278 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.278 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.280 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.280 max
2024-10-04 15:23:33.280 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.280 nr_descendants 0
2024-10-04 15:23:33.280 nr_dying_descendants 0
2024-10-04 15:23:33.280 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.280 ==> /sys/fs/cgroup/nomad.slice/share.slice/b30be889-ebb4-8524-cc84-bfd79acd6057.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.281 ==> /sys/fs/cgroup/nomad.slice/share.slice/d2a4176c-d6f0-14d6-95ef-6b3b332be2d8.requestmachine.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.281 max
2024-10-04 15:23:33.281 ==> /sys/fs/cgroup/nomad.slice/share.slice/d2a4176c-d6f0-14d6-95ef-6b3b332be2d8.requestmachine.scope/cgroup.stat <==
2024-10-04 15:23:33.281 nr_descendants 0
2024-10-04 15:23:33.281 nr_dying_descendants 0
2024-10-04 15:23:33.281 ==> /sys/fs/cgroup/nomad.slice/share.slice/d2a4176c-d6f0-14d6-95ef-6b3b332be2d8.requestmachine.scope/cpuset.cpus <==
2024-10-04 15:23:33.281 ==> /sys/fs/cgroup/nomad.slice/share.slice/d2a4176c-d6f0-14d6-95ef-6b3b332be2d8.requestmachine.scope/cpuset.mems <==
2024-10-04 15:23:33.283 ==> /sys/fs/cgroup/nomad.slice/share.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.283 max
2024-10-04 15:23:33.283 ==> /sys/fs/cgroup/nomad.slice/share.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.283 nr_descendants 0
2024-10-04 15:23:33.283 nr_dying_descendants 0
2024-10-04 15:23:33.283 ==> /sys/fs/cgroup/nomad.slice/share.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.283 ==> /sys/fs/cgroup/nomad.slice/share.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.284 ==> /sys/fs/cgroup/nomad.slice/share.slice/ee62acdb-e12b-5f8a-4da1-1f69cf492204.requestmachine.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.284 max
2024-10-04 15:23:33.284 ==> /sys/fs/cgroup/nomad.slice/share.slice/ee62acdb-e12b-5f8a-4da1-1f69cf492204.requestmachine.scope/cgroup.stat <==
2024-10-04 15:23:33.284 nr_descendants 0
2024-10-04 15:23:33.284 nr_dying_descendants 0
2024-10-04 15:23:33.284 ==> /sys/fs/cgroup/nomad.slice/share.slice/ee62acdb-e12b-5f8a-4da1-1f69cf492204.requestmachine.scope/cpuset.cpus <==
2024-10-04 15:23:33.284 ==> /sys/fs/cgroup/nomad.slice/share.slice/ee62acdb-e12b-5f8a-4da1-1f69cf492204.requestmachine.scope/cpuset.mems <==
2024-10-04 15:23:33.285 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.webserver.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.285 max
2024-10-04 15:23:33.285 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.webserver.scope/cgroup.stat <==
2024-10-04 15:23:33.285 nr_descendants 0
2024-10-04 15:23:33.285 nr_dying_descendants 0
2024-10-04 15:23:33.285 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.webserver.scope/cpuset.cpus <==
2024-10-04 15:23:33.285 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.webserver.scope/cpuset.mems <==
2024-10-04 15:23:33.287 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.287 max
2024-10-04 15:23:33.287 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.287 nr_descendants 0
2024-10-04 15:23:33.287 nr_dying_descendants 0
2024-10-04 15:23:33.287 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.287 ==> /sys/fs/cgroup/nomad.slice/share.slice/ed3ccb71-4a92-986d-40fd-d369708df5fd.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.288 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.288 max
2024-10-04 15:23:33.288 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.288 nr_descendants 0
2024-10-04 15:23:33.288 nr_dying_descendants 0
2024-10-04 15:23:33.288 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.288 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.289 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.289 max
2024-10-04 15:23:33.289 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.289 nr_descendants 0
2024-10-04 15:23:33.289 nr_dying_descendants 0
2024-10-04 15:23:33.289 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.289 ==> /sys/fs/cgroup/nomad.slice/share.slice/dadef66f-4e6f-7462-f16f-901bbb7efb66.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.291 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.291 max
2024-10-04 15:23:33.291 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.291 nr_descendants 0
2024-10-04 15:23:33.291 nr_dying_descendants 0
2024-10-04 15:23:33.291 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.291 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.292 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.webserver.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.292 max
2024-10-04 15:23:33.292 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.webserver.scope/cgroup.stat <==
2024-10-04 15:23:33.292 nr_descendants 0
2024-10-04 15:23:33.292 nr_dying_descendants 0
2024-10-04 15:23:33.292 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.webserver.scope/cpuset.cpus <==
2024-10-04 15:23:33.292 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.webserver.scope/cpuset.mems <==
2024-10-04 15:23:33.293 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.293 max
2024-10-04 15:23:33.293 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.293 nr_descendants 0
2024-10-04 15:23:33.293 nr_dying_descendants 0
2024-10-04 15:23:33.293 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.293 ==> /sys/fs/cgroup/nomad.slice/share.slice/e1e36d3b-0846-e0c1-730c-f1ae72d6ee0e.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.295 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.295 max
2024-10-04 15:23:33.295 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.295 nr_descendants 0
2024-10-04 15:23:33.295 nr_dying_descendants 0
2024-10-04 15:23:33.295 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.295 ==> /sys/fs/cgroup/nomad.slice/share.slice/08b62bda-7117-0d53-3d4b-6efaa6aaecd6.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.296 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.296 max
2024-10-04 15:23:33.296 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.296 nr_descendants 0
2024-10-04 15:23:33.296 nr_dying_descendants 0
2024-10-04 15:23:33.296 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.296 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.298 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.realtime-jobqueuerunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.298 max
2024-10-04 15:23:33.298 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.realtime-jobqueuerunner.scope/cgroup.stat <==
2024-10-04 15:23:33.298 nr_descendants 0
2024-10-04 15:23:33.298 nr_dying_descendants 0
2024-10-04 15:23:33.298 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.realtime-jobqueuerunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.298 ==> /sys/fs/cgroup/nomad.slice/share.slice/d5099ad0-d1f2-fffa-04cf-153db0063081.realtime-jobqueuerunner.scope/cpuset.mems <==
2024-10-04 15:23:33.299 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.realtime-taskrunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.299 max
2024-10-04 15:23:33.299 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.realtime-taskrunner.scope/cgroup.stat <==
2024-10-04 15:23:33.299 nr_descendants 0
2024-10-04 15:23:33.299 nr_dying_descendants 0
2024-10-04 15:23:33.299 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.realtime-taskrunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.299 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.realtime-taskrunner.scope/cpuset.mems <==
2024-10-04 15:23:33.300 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.300 max
2024-10-04 15:23:33.300 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.300 nr_descendants 0
2024-10-04 15:23:33.300 nr_dying_descendants 0
2024-10-04 15:23:33.300 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.300 ==> /sys/fs/cgroup/nomad.slice/share.slice/d558c62a-7dd4-bd57-576a-41c8372e200d.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.302 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.302 max
2024-10-04 15:23:33.302 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.302 nr_descendants 0
2024-10-04 15:23:33.302 nr_dying_descendants 0
2024-10-04 15:23:33.302 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.302 ==> /sys/fs/cgroup/nomad.slice/share.slice/05485654-7197-6dbd-40ea-c799d156d5e9.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.303 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.303 max
2024-10-04 15:23:33.303 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.303 nr_descendants 0
2024-10-04 15:23:33.303 nr_dying_descendants 0
2024-10-04 15:23:33.303 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.303 ==> /sys/fs/cgroup/nomad.slice/share.slice/00d83a43-2dcf-3a7b-3784-476b16a8236a.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.304 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.promtail.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.304 max
2024-10-04 15:23:33.304 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.promtail.scope/cgroup.stat <==
2024-10-04 15:23:33.304 nr_descendants 0
2024-10-04 15:23:33.304 nr_dying_descendants 0
2024-10-04 15:23:33.304 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.promtail.scope/cpuset.cpus <==
2024-10-04 15:23:33.304 ==> /sys/fs/cgroup/nomad.slice/share.slice/cf1b65a1-f0a4-90f4-df88-8d0b1c5d5931.promtail.scope/cpuset.mems <==
2024-10-04 15:23:33.305 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.realtime-taskrunner.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.305 max
2024-10-04 15:23:33.305 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.realtime-taskrunner.scope/cgroup.stat <==
2024-10-04 15:23:33.305 nr_descendants 0
2024-10-04 15:23:33.305 nr_dying_descendants 0
2024-10-04 15:23:33.305 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.realtime-taskrunner.scope/cpuset.cpus <==
2024-10-04 15:23:33.305 ==> /sys/fs/cgroup/nomad.slice/share.slice/f154c016-5663-62ff-b484-631b2d063f30.realtime-taskrunner.scope/cpuset.mems <==
2024-10-04 15:23:33.306 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.max.descendants <==
2024-10-04 15:23:33.306 max
2024-10-04 15:23:33.306 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat <==
2024-10-04 15:23:33.306 nr_descendants 1
2024-10-04 15:23:33.306 nr_dying_descendants 11
2024-10-04 15:23:33.306 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus <==
2024-10-04 15:23:33.306 0-3
2024-10-04 15:23:33.306 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.mems <==
2024-10-04 15:23:33.307 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.realtime-gunicorn.scope/cgroup.max.descendants <==
2024-10-04 15:23:33.307 max
2024-10-04 15:23:33.307 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.realtime-gunicorn.scope/cgroup.stat <==
2024-10-04 15:23:33.307 nr_descendants 0
2024-10-04 15:23:33.307 nr_dying_descendants 0
2024-10-04 15:23:33.307 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.realtime-gunicorn.scope/cpuset.cpus <==
2024-10-04 15:23:33.307 4-7
2024-10-04 15:23:33.307 ==> /sys/fs/cgroup/nomad.slice/reserve.slice/bd759fed-5e8d-90b4-2110-94d5f79737a8.realtime-gunicorn.scope/cpuset.mems <==

shoenig · 2024-10-24T14:49:46Z

Hi @rodrigol-chan - just to clarify, is this only happening on this one specific node? Are there any tasks still running on this node that were originally created from before the upgrade to Nomad 1.7? Has the node been rebooted since the upgrade to Nomad 1.7?

tgross · 2024-10-24T15:19:46Z

For some additional context, we've been investigating to figure out the circumstances in which the kernel can return this "no space left on device" error in the first place.

That error is referred to as ENOSPC and in the kernel you've got for Ubuntu 22.04, there's only one place that can be returned for cgroups v2. That's in validate_change in cpuset.c#L637-L649. I'm pointing to the mirror of Torvald's tree here but I've confirmed this function is the same on Ubuntu's tree for my current 22.04 kernel:

$ git remote add jammy git://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy
$ git fetch jammy # wait a while...
$ git checkout -b jammy-5.15.0-124.134 Ubuntu-5.15.0-124.134

Here's the relevant section, with a helpful comment:

/*
 * Cpusets with tasks - existing or newly being attached - can't
 * be changed to have empty cpus_allowed or mems_allowed.
 */
ret = -ENOSPC;
if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) {
    if (!cpumask_empty(cur->cpus_allowed) &&
        cpumask_empty(trial->cpus_allowed))
        goto out;
    if (!nodes_empty(cur->mems_allowed) &&
        nodes_empty(trial->mems_allowed))
        goto out;
}

So that suggests that we're somehow ending up in a state where the cpuset is being emptied of cpus or mems allowed while the task is still live. That's the source of @shoenig's follow-up questions above.

rodrigol-chan · 2024-10-25T08:11:28Z

is this only happening on this one specific node?

No, it happens on more nodes, though I just noticed that it only happens on nodes where we allow periodic jobs to run. The nodes where we do not allow periodic jobs have the exact same configuration as the ones where we do, with the difference that they are preemptible instances, i.e. Google will arbitrarily power them off.

Are there any tasks still running on this node that were originally created from before the upgrade to Nomad 1.7? Has the node been rebooted since the upgrade to Nomad 1.7?

The oldest running allocation I see is from 18th October (7 days ago), whose job was submitted on Oct 14th. The oldest current/running job version is from 2024-06-25T14:38:16Z, a few days after the 1.7 upgrade, and the same job also contains the oldest job version that Nomad still remembers, dated 2024-05-02T09:22:13Z. The vast majority of jobs have been submitted this week since we do 20+ releases per day.

All nodes run unattended-upgrades and have rebooted since Oct 15th.

I want to clarify that we're running the linux-gcp kernel since we're on Google Cloud, so we're actually currently running the 6.8 kernel. At the time this started, I believe we were on 6.5.

$ uname -a
Linux nomad-client-camel 6.8.0-1016-gcp #18~22.04.1-Ubuntu SMP Tue Oct  8 14:58:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

that suggests that we're somehow ending up in a state where the cpuset is being emptied of cpus or mems allowed while the task is still live

I'll add some more instrumentation to look at the process tree when the issue happens. Is there any more information I can produce?

tgross · 2024-10-25T14:17:24Z

we're actually currently running the 6.8 kernel.

Ok, in the 6.8 kernel there's a second place this error can appear (ref cpuset.c#L3250-L3262), which is when the effective_cpus are empty.

I'll add some more instrumentation to look at the process tree when the issue happens. Is there any more information I can produce?

I suspect we want to look at all the cpuset files in the tree. Something like:

for f in /sys/fs/cgroup/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/*.slice/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/*.slice/*.scope/cpuset.*; do echo -n "$f :"; cat "$f"; done

mvegter · 2024-10-29T15:05:40Z

Perhaps this (partially) relates to #24304 / #24297

tgross · 2024-11-15T18:56:08Z

I would definitely think you're onto something there, but I was never able to reproduce the specific error message we're seeing when simulating overlapping cores. I'm going to tag this issue for further attention, but we'll also see how #24304 helps once that lands.

rodrigol-chan added the type/bug label Jun 21, 2024

tgross added theme/cgroups cgroups issues theme/driver/exec stage/waiting-reply labels Jun 21, 2024

tgross self-assigned this Jun 21, 2024

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross removed the stage/waiting-reply label Jun 26, 2024

tgross added the stage/waiting-reply label Oct 24, 2024

tgross removed the stage/waiting-reply label Oct 25, 2024

tgross added the hcc/jira label Nov 15, 2024

tgross removed their assignment Nov 15, 2024

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuset: `no space left on device` #23405

cpuset: `no space left on device` #23405

rodrigol-chan commented Jun 21, 2024 •

edited

Loading

tgross commented Jun 21, 2024

rodrigol-chan commented Jun 24, 2024

rodrigol-chan commented Jun 24, 2024

tgross commented Jun 24, 2024

rodrigol-chan commented Sep 13, 2024

tgross commented Sep 16, 2024

tgross commented Sep 27, 2024

rodrigol-chan commented Sep 30, 2024

tgross commented Oct 3, 2024

rodrigol-chan commented Oct 4, 2024

rodrigol-chan commented Oct 4, 2024

shoenig commented Oct 24, 2024

tgross commented Oct 24, 2024

rodrigol-chan commented Oct 25, 2024

tgross commented Oct 25, 2024

mvegter commented Oct 29, 2024

tgross commented Nov 15, 2024

cpuset: no space left on device #23405

cpuset: no space left on device #23405

Comments

rodrigol-chan commented Jun 21, 2024 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

Nomad client configuration

tgross commented Jun 21, 2024

rodrigol-chan commented Jun 24, 2024

rodrigol-chan commented Jun 24, 2024

tgross commented Jun 24, 2024

rodrigol-chan commented Sep 13, 2024

tgross commented Sep 16, 2024

tgross commented Sep 27, 2024

rodrigol-chan commented Sep 30, 2024

tgross commented Oct 3, 2024

rodrigol-chan commented Oct 4, 2024

rodrigol-chan commented Oct 4, 2024

shoenig commented Oct 24, 2024

tgross commented Oct 24, 2024

rodrigol-chan commented Oct 25, 2024

tgross commented Oct 25, 2024

mvegter commented Oct 29, 2024

tgross commented Nov 15, 2024

cpuset: `no space left on device` #23405

cpuset: `no space left on device` #23405

rodrigol-chan commented Jun 21, 2024 •

edited

Loading