-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cpuset: no space left on device
#23405
Comments
Hi @rodrigol-chan! Sorry to hear you're running into trouble. The error you're getting here is particularly weird:
We're writing to the
|
It happened again just now, on a different machine.
This Nomad client configuration now looks relevant: client {
gc_max_allocs = 300
gc_disk_usage_threshold = 80
} And currently have over 300 allocations:
Nomad seems to be keeping a lot of tmpfss around even if the allocations aren't running anymore. I'm not sure if that's by design.
|
For extra context: issue seems new with the 1.7.x upgrade. We've run this configuration in 1.6.x for about 8 months with no similar issues. |
Thanks for that extra info @rodrigol-chan. Even with that large number of allocs, I'd think you'd be ok until you get to 65535 inodes. I'll dig into that a little further to see if there's some more
The mounts are left in place until the allocation is GC'd on the client. We do that so that you can debug failed allocations. |
The issue still happens as of 1.8.3. Is there anything we can do to help troubleshoot this? |
Hi @rodrigol-chan, sorry, I haven't been able to circle back to this and I'm currently swamped trying to land some work for our 1.9 beta next week. I suspect this is platform-specific. I think you'll want to look into whether there's anything in the host configuration that could be limiting the size of those virtual FS directories. |
Hi @rodrigol-chan! Just wanted to check in so you don't think I've forgotten this issue. I re-read through your initial report to see if there were any clues I missed.
Even ignoring the errors you're seeing, that's got to be a bug all by itself. These should never overlap. Even though we can't write to the two files atomically, we always remove from the source first and then write to the destination. So in that tiny race you should see a missing CPU but not one counted twice. So I'll look into seeing if I can find any place where there's potentially another race condition here where that's not correctly handled.
You have other allocations on the same host that do use core constraints though? If not, we're writing an empty value to the cgroup. In which case, I found this Stack Exchange post which describes that scenario, but has no answer. 🤦 I managed to dig up a few old issues that suggest that if Also, I wanted to see if I could get this error outside of Nomad by echoing a bad input to the cgroup file, and wasn't able to get that same error.
I did get some interesting (but different) errors trying to write to the
One more thing I'd like you to try is the following, to make sure we've counted the cgroups correctly when trying to figure out if its the inodes issue:
|
That's correct.
Just happened again: # find /sys/fs/cgroup -depth -type d | wc -l
81
# find /sys/fs/cgroup/nomad.slice -depth -type d | wc -l
29
# head /sys/fs/cgroup/nomad.slice/cpuset.cpus /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus
==> /sys/fs/cgroup/nomad.slice/cpuset.cpus <==
0-31
==> /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus <==
0-3
==> /sys/fs/cgroup/nomad.slice/share.slice/cpuset.cpus <==
4-31 Log output:
It doesn't look like the CPUs overlapped this time. The number of dying descendants is curious, I wonder if it's related: # head /sys/fs/cgroup/nomad.slice/cgroup.stat /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat /sys/fs/cgroup/nomad.slice/share.slice/cgroup.stat
==> /sys/fs/cgroup/nomad.slice/cgroup.stat <==
nr_descendants 28
nr_dying_descendants 2356
==> /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat <==
nr_descendants 1
nr_dying_descendants 78
==> /sys/fs/cgroup/nomad.slice/share.slice/cgroup.stat <==
nr_descendants 25
nr_dying_descendants 2278 |
Can you confirm whether the
|
I did look at that at failure time and from memory it was at
I can't find any
I'll doublecheck |
Just happened again. (It has been happening strangely often lately.) Here are the values requested. $ head /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.max.descendants
max
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cgroup.stat
nr_descendants 1
nr_dying_descendants 11
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.cpus
0-3
$ head /sys/fs/cgroup/nomad.slice/reserve.slice/cpuset.mems
$ This doesn't look like it should be possible, though:
It might just be an artifact of how the data is collected since I don't think it's possible to do an atomic snapshot of cgroups. All Nomad cgroups
|
Hi @rodrigol-chan - just to clarify, is this only happening on this one specific node? Are there any tasks still running on this node that were originally created from before the upgrade to Nomad 1.7? Has the node been rebooted since the upgrade to Nomad 1.7? |
For some additional context, we've been investigating to figure out the circumstances in which the kernel can return this "no space left on device" error in the first place. That error is referred to as
Here's the relevant section, with a helpful comment: /*
* Cpusets with tasks - existing or newly being attached - can't
* be changed to have empty cpus_allowed or mems_allowed.
*/
ret = -ENOSPC;
if ((cgroup_is_populated(cur->css.cgroup) || cur->attach_in_progress)) {
if (!cpumask_empty(cur->cpus_allowed) &&
cpumask_empty(trial->cpus_allowed))
goto out;
if (!nodes_empty(cur->mems_allowed) &&
nodes_empty(trial->mems_allowed))
goto out;
} So that suggests that we're somehow ending up in a state where the cpuset is being emptied of cpus or mems allowed while the task is still live. That's the source of @shoenig's follow-up questions above. |
No, it happens on more nodes, though I just noticed that it only happens on nodes where we allow periodic jobs to run. The nodes where we do not allow periodic jobs have the exact same configuration as the ones where we do, with the difference that they are preemptible instances, i.e. Google will arbitrarily power them off.
The oldest running allocation I see is from 18th October (7 days ago), whose job was submitted on Oct 14th. The oldest current/running job version is from 2024-06-25T14:38:16Z, a few days after the 1.7 upgrade, and the same job also contains the oldest job version that Nomad still remembers, dated 2024-05-02T09:22:13Z. The vast majority of jobs have been submitted this week since we do 20+ releases per day. All nodes run I want to clarify that we're running the $ uname -a
Linux nomad-client-camel 6.8.0-1016-gcp #18~22.04.1-Ubuntu SMP Tue Oct 8 14:58:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
I'll add some more instrumentation to look at the process tree when the issue happens. Is there any more information I can produce? |
Ok, in the 6.8 kernel there's a second place this error can appear (ref
I suspect we want to look at all the for f in /sys/fs/cgroup/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/*.slice/cpuset.*; do echo -n "$f :"; cat "$f"; done
for f in /sys/fs/cgroup/nomad.slice/*.slice/*.scope/cpuset.*; do echo -n "$f :"; cat "$f"; done |
I would definitely think you're onto something there, but I was never able to reproduce the specific error message we're seeing when simulating overlapping cores. I'm going to tag this issue for further attention, but we'll also see how #24304 helps once that lands. |
Nomad version
Operating system and Environment details
Running Ubuntu 22.04 on Google Cloud in an n2d-standard-32 instance.
Issue
Alerts fired due to failed allocations. Upon investigation, I noticed the following log line:
Also interesting to observe is that, unlike in our other 1.7.x clients, there's overlap between the CPUs for the reserve and share slices:
Reproduction steps
Not clear how to reproduce. This happened on a single instance. All allocations that failed are from periodic jobs, running with on the
exec
driver with nocore
constraints.Expected Result
Allocations spawn successfully.
Actual Result
Allocations failed to spawn.
Nomad Client logs (if appropriate)
Nomad client configuration
The text was updated successfully, but these errors were encountered: