Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on restart with 0.9.1 #5840

Closed
the-maldridge opened this issue Jun 14, 2019 · 14 comments
Closed

Crash on restart with 0.9.1 #5840

the-maldridge opened this issue Jun 14, 2019 · 14 comments

Comments

@the-maldridge
Copy link

the-maldridge commented Jun 14, 2019

Nomad version

$ nomad version
Nomad v0.9.1

Operating system and Environment details

Operating system is Alpine Linux 3.9.4, up to date as of this morning. We use the docker driver and have consul adjacent to docker on all machines. Nomad and Consul are installed on the metal with no containerization.

Issue

I drained a node due to some tasks on it failing to register health checks with the intent of rebooting it. As much as I hate shotgun debugging , it was the fastest option. Upon the machine coming back up the Nomad client did not re-join the cluster. I logged in and observed that there were 50+ nomad logmon's running. I stopped nomad, pkilled the remaining processes, and then restarted it while tailing the logs. After approximately 12 seconds nomad crashes and begins rapidly leaking goroutines and logmon and docker logs processes at a rate of around 2 every 5 seconds.

Reproduction steps

Not sure, I can't come up with a triggering action. I enable drain state, then waited for the drain to complete, then rebooted. The only thing I can come up with is state.db is almost 700M.

EDIT: I moved the state file out of the way and the node was able to restart and rejoin the cluster. I have a copy of this file if needed, but its very large and is really too big to email, and I don't want to attach it here.

Nomad Client logs (if appropriate)

Attached are the first 10k of the 60M log file from an attempt to restart the client.
compute1-nomad-client.log

@chrisboulton
Copy link

I have a feeling we're running into this as well - it's basically a death spiral that the Nomad agent gets into and the only way to recover is a) increase max files to ridiculous value -- 80k was the last I tried and pushed us through b) remove the state file and start fresh, like you suggested.

In our case, it looks like we have a lot of dangling allocations for complete periodic tasks, which Nomad tried to recover in some way or another on a restart. At least sometimes when this happens, I see a whole wave of messages like the below show up in the logs:

Jun 27 20:10:35 x nomad[18426]: 2019-06-27T20:10:35.732Z [WARN ] client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=57e2f950-2eac-47af-6ba5-c4552bf5b2ba task=x @module=logmon timestamp=2019-06-27T20:10:35.732Z
$ grep client.alloc_runner.task_runner.task_hook /var/log/system.log | grep periodic | grep '19:57' | wc -l
9191

$ grep client.alloc_runner.task_runner.task_hook /var/log/system.log | grep -v periodic | grep '19:57' | wc -l
49

Sometimes we just see a Nomad start followed by a kill, and no other log messages (I suspect that's Go panicking/killing it?)

Jun 27 19:56:03 x systemd[1]: Starting nomad agent...
Jun 27 19:56:03 x systemd[1]: Started nomad agent.
Jun 27 19:56:03 x nomad[12835]: ==> Loaded configuration from /etc/nomad/nomad.hcl
Jun 27 19:56:03 x nomad[12835]: ==> Starting Nomad agent...
Jun 27 19:56:19 x systemd[1]: nomad.service: main process exited, code=killed, status=9/KILL

Any chance you're using periodic tasks and have a whole bunch that don't appear to be fully GCd?

(I ran across this upgrading from v0.9.1 to v0.9.3, and I have a feeling we saw it when moving 0.9.0 to 0.9.1 as well)

@the-maldridge
Copy link
Author

@chrisarcand you've hit it on the nose. We're running an obscene amount of batch tasks, and we have good reason to believe that the gc is incomplete/broken. My env is on 0.9.3 but I still observe the crashes. On a hunch would you check how big your state.db is? I suspect its grown beyond 380M in size.

@chrisboulton
Copy link

@the-maldridge So we have a potential lead then! 😬

Our state files are all pretty small - most are under 5MB, with a couple around ~60MB. I'm going to say maybe that has something to do with the number of/frequency of batch jobs running though.

notnoop pushed a commit that referenced this issue Jun 29, 2019
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in #5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
#5840 .
@notnoop
Copy link
Contributor

notnoop commented Jul 1, 2019

I've investigated this and I found two significant contributing causes:

@the-maldridge It seems that at your scale of batch jobs, a percentage of allocs (hopefully a small one) failed to start and remain un-gced. While the client is running, these are mostly harmless. Though on client restart, all these wedged allocs get started at once result into the crash.

The two fixes above should mitigate the problem; though we should follow up with why the allocs failed to start in the first place and address any issues that come up there.

@the-maldridge
Copy link
Author

@notnoop Thanks for taking a look into this. Can you clarify what you mean by "failed to start"? I would be happy to look into this further to try and root-cause this hiccup. I also have a state db from an unhappy node saved and can run tests against it to find these unhappy allocs.

@notnoop
Copy link
Contributor

notnoop commented Jul 1, 2019

Thanks. In this case, I believe the logs would be most helpful, ideally from before client restart but given load I suspect it would occur against after new start. I would look for logs with the following error messages prerun failed[1] or prestart failed[2]. These should be the triggering cause.

We were able to reproduce the symptoms reported here by running lots of allocations on a single node with relatively small ulimit values and seeing too many open files. Increasing the ulimit values seems to have helped in our testing.

[1] https://github.com/hashicorp/nomad/blob/v0.9.1/client/allocrunner/alloc_runner.go#L246
[2] https://github.com/hashicorp/nomad/blob/v0.9.1/client/allocrunner/taskrunner/task_runner.go#L406

@the-maldridge
Copy link
Author

Interesting. It appears I overlooked the ulimit as most of my prod machines seem to be holding the default ulimit of 1024. I will plan to increase this during the next maintenance window to something larger. Is there a nomad vetted value I should target?

notnoop pushed a commit that referenced this issue Jul 2, 2019
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in #5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
#5840 .
@notnoop
Copy link
Contributor

notnoop commented Jul 3, 2019

@the-maldridge We've been recommending high setting, like 65k, but we haven't aimed to find a lower bound. Generally, it's proportional to number of tasks running and the requirements of those tasks.

Related to this question, @schmichael just updated our systemd to recommend 65k instead of unlimited to mitigate denial-of-service risks in https://github.com/hashicorp/nomad/pull/5912/files .

@chrisboulton
Copy link

chrisboulton commented Jul 15, 2019

@notnoop - we're running into these problems pretty frequently at the moment (at least, what I described in my comment above, and what we think are related to the issues you mentioned).

I'm considering building a patched v0.9.3 while we wait for v0.9.4 (unless you think that's going to be in the next couple of days). I was going to cherry-pick down the PRs you mentioned: #5890, and #5905 as well as #5865 which looks interesting. Are you aware of any other contributing factors/fixes I might want to merge in?

@the-maldridge
Copy link
Author

@chrisboulton we're running a patched 0.9.3 right now in staging, it is reasonably stable, but it will be some days before we promote it to prod.

@the-maldridge
Copy link
Author

did not intend to close.

chrisboulton pushed a commit to bigcommerce/nomad that referenced this issue Jul 15, 2019
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in hashicorp#5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
hashicorp#5840 .
@notnoop
Copy link
Contributor

notnoop commented Jul 16, 2019

@chrisarcand I'd suggest running a patched 0.9.3. master currently stable AFAIK and contain many bug fixes; I'd suggest scanning the changelog for changes that may affect you.

@notnoop
Copy link
Contributor

notnoop commented Aug 23, 2019

Closing this ticket as it looks to have been addressed and haven't heard an update to the contrary in the last month. Please re-open ticket if that's not the case and/or have an update. Thanks!

@notnoop notnoop closed this as completed Aug 23, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants