-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runaway nomad process after Nomad client reboot #5984
Comments
Martin, There are a couple of fixes around client rebooting that will be included in 0.9.4, and we're planning to push a release candidate today. The issue where allocations come back as lost may be related to #5890, and you may also be encountering GC failure addressed here #5905. Please try the rc when it's available! What does your client configuration look like, especially the |
Hi Lang, OK, I'll wait for the 0.9.4 RC and try that out and report back. As for my node config, all I have in the
I've gone with the default GC values as I don't usually tweak knobs I don't fully understand. Should I have preferably overridden some of the default GC values? |
@notnoop I installed 0.9.4-rc1 on a single one of my clients and then proceeded to restart this client without first draining it. Upon reboot the client started many allocations until memory (2GB RAM + 2GB swap) ran out. I managed to stop the process, I'll have to try a reboot again once allocs build in the data_dir. |
@notnoop Some more details... I left allocs build up overnight on the 0.9.4-rc1 client. When I checked this morning the I then proceeded to One of the first Nomad errors I saw when reviewing the systemd journal was this one:
I saw 50 of these errors (one for each alloc in the data_dir). Two lines below I saw this:
I don't know where that high number (3172) comes from. I then have 796 events in the style:
Mixed in with 625 lines in the style:
Then it's 273 lines with:
Which finally seems to all come crashing down with:
and it continues for a while. |
@notnoop Am I the only one having reported such an issue? |
@radcool Thank you so much for the detailed messages and debugging. That is very helpful. I'm sorry that I got sidetracked with other work. This is one of our high priority issues to address in 0.10 and I'll investigate it and follow up with questions as they come. |
No worries @notnoop. I thought perhaps this was an issue on my side, not seen anywhere else. Thanks. |
This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890
@notnoop I compiled a new Nomad binary from #6207 as you requested:
installed it on a Nomad client, and rebooted that client. Unfortunately once it came back up both vCPUs shot up to 100% and memory rapidly started filling up. Perhaps I did something wrong, but whatever I did I got the same behavior as before. It's a bit late here but I tomorrow I'll sift through the journal logs and report back. |
@notnoop I did another reboot of the client running 0.10-dev this morning and the runaway behavior has not occurred this time, Upon re-reading the following text of #6207: Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc I guess this is actually expected. Can you please confirm that going from 0.9.x --> v0.10.0-dev will initially not fix the issue after process restart as 0.10.0 will still try to process the allocs from the 0.9.x-created state DB, and that only once 0.10.0-dev becomes the running version, then the issue will no longer occur at subsequent reboots since the allocs will no longer be persisted to the state DB under 0.10.0? |
@radcool This reading is correct. The change doesn't recover from already "corrupt" persisted state left by the earlier buggy client process; but ensures that state is stored correctly for future client restarts. I'd research options to how to recover without locking up or slowing restores unnecessarily. |
@notnoop Thanks for your help by the way! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Hello,
I've been experiencing an issue with Nomad and I'm not sure if it's a bug, or if I'm just abusing Nomad.
I have a Nomad cluster (0.9.3 everywhere on CentOS 7) with 3 servers and 6 clients (3 running exec and 3 running raw_exec). All are on VMs with 2 CPUs and 2 GBs of memory. That might not seem like a lot (and maybe that's my problem), but I'm not running CPU intensive workload so I assumed this would be OK.
I run a few microweb services, a system task (Fabio) and some periodic batch jobs, including a few parameterized periodic batch jobs that run every minute using
raw_exec
.So although I don't consider the clients taxed by any means, the list of completed jobs reported by
nomad status
is quite high:The problem is that if I restart a client node with a simple
reboot
(no draining before), when it boots up again Nomad starts working like crazy, bringing both CPUs to 100% and eventually running out of memory, essentially taking that client out of commission. The only way I've then found to recover the client is to forcefully reset the VM, stop Nomad as soon as the VM is booted (and before Nomad goes out of control again), delete the contents of thedata_dir
, and restart Nomad again, thereby creating a new Nomad node.Something weird I've also noticed is that before the reboot, a
nomad status <job_id>
shows all periodic jobs as having one allocation with statuscomplete
, but after the reboot I sometimes spot jobs that now have two allocations: one with statuscomplete
and one with statuslost
, even though I'm pretty sure that the job originally only had a single allocation with statuscomplete
:So, have I run into a bug, hit a Nomad operational limit (abusing it), or am I simply doing things wrong and shooting myself in the foot?
Thanks,
-Martin
The text was updated successfully, but these errors were encountered: