Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 0.9.0 Client deadlocks when starting after reboot #5566

Closed
schmichael opened this issue Apr 15, 2019 · 4 comments · Fixed by #5568
Closed

Nomad 0.9.0 Client deadlocks when starting after reboot #5566

schmichael opened this issue Apr 15, 2019 · 4 comments · Fixed by #5568

Comments

@schmichael
Copy link
Member

schmichael commented Apr 15, 2019

Originally reported by @vkiranananda here: #2560 (comment)

Nomad version

Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Operating system and Environment details

dev vagrant box

Issue

When a node (the actual VM or server) is reboot Nomad deadlocks on startup trying to restore Docker tasks.

Reproduction steps

On the dev vagrant box with Nomad 0.9.0 installed with this config file.

Terminal 1:

curl https://gist.githubusercontent.com/schmichael/579e22c5bd64cd9a3b05e81c3a0964e9/raw/f3bb5ed8a2aeebc1aeccf24e86f508472418592e/devagent.hcl > devagent.hcl
sudo nomad agent -config devagent.hcl

Terminal 2:

nomad init 
nomad run example.nomad
# wait for job to be running
^D # logout of vagrant box
vagrant halt
vagrant up
vagrant ssh
sudo nomad agent -config devagent.hcl
# deadlocks

Job file (if appropriate)

Example job (w/o service stanza)

Nomad Client logs (if appropriate)

Full logs + goroutine dump here: https://gist.github.com/schmichael/b0f663b293c9f2c47e2790c4f9f8fb70

@schmichael
Copy link
Member Author

Killing the docker_logger unblocks the client and causes it to start a new alloc. Task events from original alloc:

2019-04-15T22:33:46Z  Killing           Sent interrupt
2019-04-15T22:33:46Z  Restarting        Task restarting in 15.954383114s
2019-04-15T22:33:46Z  Task hook failed  logmon: Reattachment process not found

@MorphBonehunter
Copy link

MorphBonehunter commented Apr 16, 2019

maybe this is related/duplicate to #5561 ?

@schmichael
Copy link
Member Author

@MorphBonehunter Oh interesting! Yes! The only difference is that you kill all Nomad processes but the container is still running. I actually like your repro much better.

Going to close #5561 just because we've already started referencing this issue in PRs and internal comms. Sorry for missing yours and thanks for commenting. Fix is landing soon.

notnoop added a commit that referenced this issue Apr 16, 2019
Fixes #5566 .

Fix a case where docker logging process may lock up nomad agent restart.

Looks like we have a case where docker logger is started even through logmon isn't. In such case, the fifo writer blocks indefinitely and because the open operation happens in the main goroutine, nomad agent blocks indefinitely.

This fixes the issue where the fifo open operation happens in goroutine instead of main goroutine.

We should follow up independently to ensure logmon <-> dockerlogger ordering and consider having task recovery happen in non-main goroutine with some sensible timeouts.
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants