Crash on restart with 0.9.1 #5840

the-maldridge · 2019-06-14T18:49:35Z

Nomad version

$ nomad version
Nomad v0.9.1

Operating system and Environment details

Operating system is Alpine Linux 3.9.4, up to date as of this morning. We use the docker driver and have consul adjacent to docker on all machines. Nomad and Consul are installed on the metal with no containerization.

Issue

I drained a node due to some tasks on it failing to register health checks with the intent of rebooting it. As much as I hate shotgun debugging , it was the fastest option. Upon the machine coming back up the Nomad client did not re-join the cluster. I logged in and observed that there were 50+ nomad logmon's running. I stopped nomad, pkilled the remaining processes, and then restarted it while tailing the logs. After approximately 12 seconds nomad crashes and begins rapidly leaking goroutines and logmon and docker logs processes at a rate of around 2 every 5 seconds.

Reproduction steps

Not sure, I can't come up with a triggering action. I enable drain state, then waited for the drain to complete, then rebooted. The only thing I can come up with is state.db is almost 700M.

EDIT: I moved the state file out of the way and the node was able to restart and rejoin the cluster. I have a copy of this file if needed, but its very large and is really too big to email, and I don't want to attach it here.

Nomad Client logs (if appropriate)

Attached are the first 10k of the 60M log file from an attempt to restart the client.
compute1-nomad-client.log

The text was updated successfully, but these errors were encountered:

chrisboulton · 2019-06-27T20:23:14Z

I have a feeling we're running into this as well - it's basically a death spiral that the Nomad agent gets into and the only way to recover is a) increase max files to ridiculous value -- 80k was the last I tried and pushed us through b) remove the state file and start fresh, like you suggested.

In our case, it looks like we have a lot of dangling allocations for complete periodic tasks, which Nomad tried to recover in some way or another on a restart. At least sometimes when this happens, I see a whole wave of messages like the below show up in the logs:

Jun 27 20:10:35 x nomad[18426]: 2019-06-27T20:10:35.732Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=57e2f950-2eac-47af-6ba5-c4552bf5b2ba task=x @module=logmon timestamp=2019-06-27T20:10:35.732Z

$ grep client.alloc_runner.task_runner.task_hook /var/log/system.log | grep periodic | grep '19:57' | wc -l
9191

$ grep client.alloc_runner.task_runner.task_hook /var/log/system.log | grep -v periodic | grep '19:57' | wc -l
49

Sometimes we just see a Nomad start followed by a kill, and no other log messages (I suspect that's Go panicking/killing it?)

Jun 27 19:56:03 x systemd[1]: Starting nomad agent...
Jun 27 19:56:03 x systemd[1]: Started nomad agent.
Jun 27 19:56:03 x nomad[12835]: ==> Loaded configuration from /etc/nomad/nomad.hcl
Jun 27 19:56:03 x nomad[12835]: ==> Starting Nomad agent...
Jun 27 19:56:19 x systemd[1]: nomad.service: main process exited, code=killed, status=9/KILL

Any chance you're using periodic tasks and have a whole bunch that don't appear to be fully GCd?

(I ran across this upgrading from v0.9.1 to v0.9.3, and I have a feeling we saw it when moving 0.9.0 to 0.9.1 as well)

the-maldridge · 2019-06-27T21:43:58Z

@chrisarcand you've hit it on the nose. We're running an obscene amount of batch tasks, and we have good reason to believe that the gc is incomplete/broken. My env is on 0.9.3 but I still observe the crashes. On a hunch would you check how big your state.db is? I suspect its grown beyond 380M in size.

chrisboulton · 2019-06-27T23:03:23Z

@the-maldridge So we have a potential lead then! 😬

Our state files are all pretty small - most are under 5MB, with a couple around ~60MB. I'm going to say maybe that has something to do with the number of/frequency of batch jobs running though.

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .

notnoop · 2019-07-01T03:10:26Z

I've investigated this and I found two significant contributing causes:

Completed batch allocations and tasks will be re-run if client restarts if the alloc weren't GCed from client state db (issue Restarting nomad agent restarts successfully completed allocs #5883 fixed in Using the new "spread" stanza doesn't work (0.9.0) #5578 )
if an alloc prestart hooks failed, the allocations get wedged and stuck in pending state without ever being GCed (fixed in Fail alloc if alloc runner prestart hooks fail #5905)

@the-maldridge It seems that at your scale of batch jobs, a percentage of allocs (hopefully a small one) failed to start and remain un-gced. While the client is running, these are mostly harmless. Though on client restart, all these wedged allocs get started at once result into the crash.

The two fixes above should mitigate the problem; though we should follow up with why the allocs failed to start in the first place and address any issues that come up there.

the-maldridge · 2019-07-01T05:50:31Z

@notnoop Thanks for taking a look into this. Can you clarify what you mean by "failed to start"? I would be happy to look into this further to try and root-cause this hiccup. I also have a state db from an unhappy node saved and can run tests against it to find these unhappy allocs.

notnoop · 2019-07-01T06:54:07Z

Thanks. In this case, I believe the logs would be most helpful, ideally from before client restart but given load I suspect it would occur against after new start. I would look for logs with the following error messages prerun failed[1] or prestart failed[2]. These should be the triggering cause.

We were able to reproduce the symptoms reported here by running lots of allocations on a single node with relatively small ulimit values and seeing too many open files. Increasing the ulimit values seems to have helped in our testing.

[1] https://github.com/hashicorp/nomad/blob/v0.9.1/client/allocrunner/alloc_runner.go#L246
[2] https://github.com/hashicorp/nomad/blob/v0.9.1/client/allocrunner/taskrunner/task_runner.go#L406

the-maldridge · 2019-07-01T17:57:36Z

Interesting. It appears I overlooked the ulimit as most of my prod machines seem to be holding the default ulimit of 1024. I will plan to increase this during the next maintenance window to something larger. Is there a nomad vetted value I should target?

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in #5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for #5840 .

notnoop · 2019-07-03T02:03:51Z

@the-maldridge We've been recommending high setting, like 65k, but we haven't aimed to find a lower bound. Generally, it's proportional to number of tasks running and the requirements of those tasks.

Related to this question, @schmichael just updated our systemd to recommend 65k instead of unlimited to mitigate denial-of-service risks in https://github.com/hashicorp/nomad/pull/5912/files .

chrisboulton · 2019-07-15T18:06:29Z

@notnoop - we're running into these problems pretty frequently at the moment (at least, what I described in my comment above, and what we think are related to the issues you mentioned).

I'm considering building a patched v0.9.3 while we wait for v0.9.4 (unless you think that's going to be in the next couple of days). I was going to cherry-pick down the PRs you mentioned: #5890, and #5905 as well as #5865 which looks interesting. Are you aware of any other contributing factors/fixes I might want to merge in?

the-maldridge · 2019-07-15T19:01:33Z

@chrisboulton we're running a patched 0.9.3 right now in staging, it is reasonably stable, but it will be some days before we promote it to prod.

the-maldridge · 2019-07-15T19:02:41Z

did not intend to close.

When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in hashicorp#5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for hashicorp#5840 .

notnoop · 2019-07-16T02:28:27Z

@chrisarcand I'd suggest running a patched 0.9.3. master currently stable AFAIK and contain many bug fixes; I'd suggest scanning the changelog for changes that may affect you.

notnoop · 2019-08-23T22:59:31Z

Closing this ticket as it looks to have been addressed and haven't heard an update to the contrary in the last month. Please re-open ticket if that's not the case and/or have an update. Thanks!

github-actions · 2022-11-19T02:26:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

preetapan added stage/needs-investigation type/bug theme/client and removed stage/needs-investigation labels Jun 18, 2019

notnoop mentioned this issue Jun 29, 2019

Fail alloc if alloc runner prestart hooks fail #5905

Merged

the-maldridge closed this as completed Jul 15, 2019

the-maldridge reopened this Jul 15, 2019

notnoop closed this as completed Aug 23, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on restart with 0.9.1 #5840

Crash on restart with 0.9.1 #5840

the-maldridge commented Jun 14, 2019 •

edited

Loading

chrisboulton commented Jun 27, 2019

the-maldridge commented Jun 27, 2019

chrisboulton commented Jun 27, 2019

notnoop commented Jul 1, 2019

the-maldridge commented Jul 1, 2019

notnoop commented Jul 1, 2019

the-maldridge commented Jul 1, 2019

notnoop commented Jul 3, 2019

chrisboulton commented Jul 15, 2019 •

edited

Loading

the-maldridge commented Jul 15, 2019

the-maldridge commented Jul 15, 2019

notnoop commented Jul 16, 2019

notnoop commented Aug 23, 2019

github-actions bot commented Nov 19, 2022

Crash on restart with 0.9.1 #5840

Crash on restart with 0.9.1 #5840

Comments

the-maldridge commented Jun 14, 2019 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

chrisboulton commented Jun 27, 2019

the-maldridge commented Jun 27, 2019

chrisboulton commented Jun 27, 2019

notnoop commented Jul 1, 2019

the-maldridge commented Jul 1, 2019

notnoop commented Jul 1, 2019

the-maldridge commented Jul 1, 2019

notnoop commented Jul 3, 2019

chrisboulton commented Jul 15, 2019 • edited Loading

the-maldridge commented Jul 15, 2019

the-maldridge commented Jul 15, 2019

notnoop commented Jul 16, 2019

notnoop commented Aug 23, 2019

github-actions bot commented Nov 19, 2022

the-maldridge commented Jun 14, 2019 •

edited

Loading

chrisboulton commented Jul 15, 2019 •

edited

Loading