-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on restart with 0.9.1 #5840
Comments
I have a feeling we're running into this as well - it's basically a death spiral that the Nomad agent gets into and the only way to recover is a) increase max files to ridiculous value -- 80k was the last I tried and pushed us through b) remove the state file and start fresh, like you suggested. In our case, it looks like we have a lot of dangling allocations for complete periodic tasks, which Nomad tried to recover in some way or another on a restart. At least sometimes when this happens, I see a whole wave of messages like the below show up in the logs:
Sometimes we just see a Nomad start followed by a kill, and no other log messages (I suspect that's Go panicking/killing it?)
Any chance you're using periodic tasks and have a whole bunch that don't appear to be fully GCd? (I ran across this upgrading from v0.9.1 to v0.9.3, and I have a feeling we saw it when moving 0.9.0 to 0.9.1 as well) |
@chrisarcand you've hit it on the nose. We're running an obscene amount of batch tasks, and we have good reason to believe that the gc is incomplete/broken. My env is on 0.9.3 but I still observe the crashes. On a hunch would you check how big your state.db is? I suspect its grown beyond 380M in size. |
@the-maldridge So we have a potential lead then! 😬 Our state files are all pretty small - most are under 5MB, with a couple around ~60MB. I'm going to say maybe that has something to do with the number of/frequency of batch jobs running though. |
When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .
I've investigated this and I found two significant contributing causes:
@the-maldridge It seems that at your scale of batch jobs, a percentage of allocs (hopefully a small one) failed to start and remain un-gced. While the client is running, these are mostly harmless. Though on client restart, all these wedged allocs get started at once result into the crash. The two fixes above should mitigate the problem; though we should follow up with why the allocs failed to start in the first place and address any issues that come up there. |
@notnoop Thanks for taking a look into this. Can you clarify what you mean by "failed to start"? I would be happy to look into this further to try and root-cause this hiccup. I also have a state db from an unhappy node saved and can run tests against it to find these unhappy allocs. |
Thanks. In this case, I believe the logs would be most helpful, ideally from before client restart but given load I suspect it would occur against after new start. I would look for logs with the following error messages We were able to reproduce the symptoms reported here by running lots of allocations on a single node with relatively small ulimit values and seeing [1] https://github.com/hashicorp/nomad/blob/v0.9.1/client/allocrunner/alloc_runner.go#L246 |
Interesting. It appears I overlooked the ulimit as most of my prod machines seem to be holding the default ulimit of 1024. I will plan to increase this during the next maintenance window to something larger. Is there a nomad vetted value I should target? |
When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .
@the-maldridge We've been recommending high setting, like 65k, but we haven't aimed to find a lower bound. Generally, it's proportional to number of tasks running and the requirements of those tasks. Related to this question, @schmichael just updated our systemd to recommend 65k instead of unlimited to mitigate denial-of-service risks in https://github.com/hashicorp/nomad/pull/5912/files . |
@notnoop - we're running into these problems pretty frequently at the moment (at least, what I described in my comment above, and what we think are related to the issues you mentioned). I'm considering building a patched v0.9.3 while we wait for v0.9.4 (unless you think that's going to be in the next couple of days). I was going to cherry-pick down the PRs you mentioned: #5890, and #5905 as well as #5865 which looks interesting. Are you aware of any other contributing factors/fixes I might want to merge in? |
@chrisboulton we're running a patched 0.9.3 right now in staging, it is reasonably stable, but it will be some days before we promote it to prod. |
did not intend to close. |
When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in hashicorp#5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for hashicorp#5840 .
@chrisarcand I'd suggest running a patched 0.9.3. master currently stable AFAIK and contain many bug fixes; I'd suggest scanning the changelog for changes that may affect you. |
Closing this ticket as it looks to have been addressed and haven't heard an update to the contrary in the last month. Please re-open ticket if that's not the case and/or have an update. Thanks! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Operating system and Environment details
Operating system is Alpine Linux 3.9.4, up to date as of this morning. We use the docker driver and have consul adjacent to docker on all machines. Nomad and Consul are installed on the metal with no containerization.
Issue
I drained a node due to some tasks on it failing to register health checks with the intent of rebooting it. As much as I hate shotgun debugging , it was the fastest option. Upon the machine coming back up the Nomad client did not re-join the cluster. I logged in and observed that there were 50+ nomad logmon's running. I stopped nomad, pkilled the remaining processes, and then restarted it while tailing the logs. After approximately 12 seconds nomad crashes and begins rapidly leaking goroutines and logmon and docker logs processes at a rate of around 2 every 5 seconds.
Reproduction steps
Not sure, I can't come up with a triggering action. I enable drain state, then waited for the drain to complete, then rebooted. The only thing I can come up with is state.db is almost 700M.
EDIT: I moved the state file out of the way and the node was able to restart and rejoin the cluster. I have a copy of this file if needed, but its very large and is really too big to email, and I don't want to attach it here.
Nomad Client logs (if appropriate)
Attached are the first 10k of the 60M log file from an attempt to restart the client.
compute1-nomad-client.log
The text was updated successfully, but these errors were encountered: