-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad client hangs indefinitely on startup #1202
Comments
I have seen this a couple of times. I turned on DEBUG logging, and still nothing. The only way I could get it to start up was to start killing existing docker containers that were running. |
Can you give background on to what the client was running, how it was restarted? If you can give reproduction steps that would be awesome! |
@dadgar so I can reproduce this error under the same conditions reported by @a86c6f7964. If you restart the Nomad client whilst you have running containers and those containers continue to run after Nomad has exited then the Nomad client is unable to start back up. |
@grobinson-blockchain @a86c6f7964 So I started an agent in the server mode and another in the client mode, ran a redis container(the one that nomad init generates) and killed the client. At that point the nomad executor and container were still running, then I restarted the client, which connected back with the executor and I could even stop the container. Can I get some more help in reproducing this? |
I can try again. But I think that it has to do with having many containers On Sun, Jun 12, 2016, 9:35 AM Diptanu Choudhury [email protected]
|
Not necessarily. I've got a very simple cluster (3 server nodes, 1 client node). With a single system job and I'm experiencing identical behaviour. Nomad Agent
Nomad Config
Job Spec
Initially everything works as expected, Nomad starts and launched system job:
As seen above, I stopped it w/ SIGINT. System job docker container is present after nomad has exited:
When I issue
But the system job enters dead state and never recovers:
|
Hey @grobinson-blockchain @a86c6f7964 @wuub, So I tried everything you guys have said and still couldn't reproduce this. Restarted the client like 30 times. If one of you can reliably get into this state, can you issue a SIGQUIT ( Thanks, |
@dadgar I can repro this every time, so if you need me to try something else just let me know. SIGQUIT goroutines dump is here, same nomad version as before (v0.4.0-dev ('bc09a0444722617a3a0ee0daa28d24b93d9d3e5b+CHANGES') https://gist.github.com/wuub/9f1671d70ffb041dec8f2d9c77404437 System information:
Instance is created with following terraform :
|
@wuub Thanks! That stack trace had what we needed. Looks like it is hung calling docker stop. Will get the timeout to work properly. Though don't know why it is hung |
@wuub So we make the following request to docker: |
I reproduced nomad's hanged state, waited a while, then issued stop command. It took several seconds (expected, since fluentd does graceful shutdown that takes a while to complete), but ultimately docker retuned "204 No Content" and removed the container.
|
@dadgar I think I found the reason, after stracing nomad binary a few times I noticed something like this:
When I replayed it using curl, I was able to replicate inifnite lockup:
After retrying out-of-band with curl & I'll do some spelunking to find out why it wants to wait so long and let you know. |
@dadgar Line 881 in 964e133
the author probably had something like this in mind: Line 954 in 964e133
I'll test it in a minute and send PR if fixes my problem |
Unit mismatch caused docker driver to wait almost indefinitely during boot (when one or more containers were a bit uncooperative during StopContainer()) This should fix problems described in hashicorp#1202
@wuub if the problem is fixed we should close this. Let me know |
@dadgar If I understand correctly this fix made it to 0.4.0? |
Yes it made it into 0.4 was waiting for confirmation since I have never been able to reproduce |
After more testing I can confirm that this is now solved for us :) Thanks |
I fix this by restarting docker, |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v.0.3.2
Issue
The Nomad client hangs on
Starting Nomad agent...
at startup. It never makes progress.When started via an interactive terminal:
Nomad configuration file:
The text was updated successfully, but these errors were encountered: