-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job stays pending for too long #2153
Comments
Hey @mildred, How large is that image? It is most likely downloading the image. In 0.5.3 we will have drivers emit extra information to make debugging this easier (it will show up in alloc-status events). The reason I believe it to be downloading the image is that once the client marks an allocation as |
I am not sure what you would like to do with this issue. We can maybe close it till you run into it again and post the logs (I would suggest running in debug level logs) so we can be more certain that it is the downloading the image or wait for 0.5.3 which should be out by end of the month. |
This PR makes GetAllocs use a blocking query as well as adding a sanity check to the clients watchAllocation code to ensure it gets the correct allocations. This PR fixes #2119 and #2153. The issue was that the client was talking to two different servers, one to check which allocations to pull and the other to pull those allocations. However the latter call was not with a blocking query and thus the client would not retreive the allocations it requested. The logging has been improved to make the problem more clear as well.
Closed by #2177 |
I'd suggest a feature request where docker would first download the image, and only then nomad would kill existing instances (in case of running only a single instance). Otherwise for large images there is a potential downtime while the image is being downloaded. |
@maximveksler If you're referring to downtime caused by job updates stopping old allocations before starting new ones the |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
If you have a question, prepend your issue with
[question]
or preferably use the nomad mailing list.If filing a bug please include the following:
Issue
Sometimes, a job takes too much time in pending state. Sometimes it immediatly starts, other times it can take more then 10 minutes, and I have memories of few hours pending.
This is not necessarily a problem, if that delay can be explained and be reduced (for example if it comes from downloading the docker image). Unfortunately, there is not much information on what is happening in pending state and one is left wondering if the job will start at all.
To help debug this, what is happening during the pending phase ? I suppose that there is the placement, shouldn't take that much time. There is also probably the docker image download (that should be the same each time I suppose). Does nomad wait for anything else before running the job ? Does it wait for the nodes to have enough free resources or not ? Can it block ?
Here is the alloc-status of a job that took 5 minutes to start:
Reproduction steps
Difficult to reproduce, but the jobs were submitted using the HTTP API.
Nomad Server logs (if appropriate)
Running many nomad clients with a server cluster of 3 nodes. I don't have the full logs for all (there was a reboot) but for the 2nd node, I have:
and the logs for the server node #3:
Nomad Client logs (if appropriate)
Unfortunately I no longer have access to the nomad client logs. The machine was stopped. I remember I looked at it and it was quite empty. So much so that I looked at the docker daemon logs. When I have this problem again, I'll make sure to include the logs.
Job file (if appropriate)
I don't have the exact job file, but it looks like this one :
The text was updated successfully, but these errors were encountered: