-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker containers don't exit until after 5 minute timeout #2119
Comments
Can you do How reliably does this happen? |
I can force this to happen within ~2 minutes by changing the count and running the job, waiting for the job output, running the job, "rinse and repeat". This particular change was count=2 to count=1 on my job from above. My alloc-status log was cleared for some reason so
However, I just reproduced the issue again and here's the output
In this particular case, the containers decided that ~5 minutes was enough time and got into the correct state.
|
I should note the version of docker I am using
|
Can you paste the full client logs of node |
I am running Nomad on the host itself via systemd. If you would like to schedule a screenshare or something like that, I'd be happy to oblige.
|
@pgporada Appreciate the offer. Hopefully we can get to the root of it async. If it is possible could you reproduce while running at log level DEBUG for both the server and client and paste the results. |
I will get you that by tomorrow. We're about to kick off a company Christmas party. Thank you for the help so far. |
Sorry for the delay. Please see the enclosed files
Specifically, why did it take 5+ minutes to kill a container?
|
We're hitting the same issue - I guess that somehow the timeout in docker.go line 89 is hit for some reason, or maybe JobGCInterval or NodeGCInterval which are also set to 5 minutes. |
Thanks for the logs and report. Will get a fix! |
Just noted that issue #2133 reports a similar symptom for the raw_exec driver. |
@bluen I think they are unrelated. Docker driver has an http client with an explicit 5 minute timeout. We use this because occasionally docker engine stops being responsive and we don't want it to hang the Nomad client. |
This PR makes GetAllocs use a blocking query as well as adding a sanity check to the clients watchAllocation code to ensure it gets the correct allocations. This PR fixes #2119 and #2153. The issue was that the client was talking to two different servers, one to check which allocations to pull and the other to pull those allocations. However the latter call was not with a blocking query and thus the client would not retreive the allocations it requested. The logging has been improved to make the problem more clear as well.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.5.1
on 3 servers and 3 clientsOperating system and Environment details
All 6 nodes running
Linux centos-7-srv1 3.10.0-327.36.3.el7.x86_64 #1 SMP Mon Oct 24 16:09:20 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Issue
Signal to shutdown a container isn't fired.
Reproduction steps
Create a nomad job with count set to 2.
Issue commands
nomad plan phil.nomad
andnomad run phil.nomad
You will see the 2 containers correctly start up on the appropriate clients.
Change the count from
2
to0
, run anomad plan phil.nomad
andnomad run phil.nomad
and you will see the containers continuing to stay up for ~5 extra minutes. You will also see aNomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
I can still hit the container that should be stopped.
When this happens, the container will hang around indefinitely. The signal(s) will be fired during the next
nomad run phil.nomad
assuming that I make a change to the job file.This is not exactly ideal.
Job file (if appropriate)
count=2
count=0
The text was updated successfully, but these errors were encountered: