-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker driver stalls downloading images with poor DNS servers #157
Comments
Thanks for reporting this! I would like to improve this but this might be a bit tricky for us to fix for two reasons:
I think we can improve this via #277 by adding simple remote logging to the CLI but solving this a way to provide more immediate feedback will be difficult without some additional plumbing on our side. |
Nomad retries docker pulls if they fail. If |
I think I just ran into this bug. I have 154 dispatch jobs all on the same node that say they're running, but their allocations are stuck in a pending state:
And it's because they're waiting on downloading the docker image:
I checked to see if that image had been downloaded:
and it hadn't so I tried to manually pull it onto the node:
And all of the layers were downloaded, it just hadn't gotten tagged. However after I pulled it the image showed as being on the node:
However all of the jobs that are waiting on that image are still waiting, for about 45 minutes now. I saw @dadgar's comment:
so I know I have a docker issue. The question is how do I make Nomad work? It's still just stuck with 154 jobs waiting on a Docker image that is now downloaded. Also, given that this Docker issue is now over two years old, is there any workaround this within Nomad? Is there a parameter that I could set that would have the effect of: if downloading a Docker image takes more than 30 minutes consider it a failure? I guess my move here is to manually kill all those jobs and let them get rescheduled, but having to resolve this issue manually is rather painful and won't be a great solution when it happens at 2 am in the morning and isn't discovered for hours. |
ci: disable Travis ci workflows in enterprise repo
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I was going through the tutorial at home, and couldn't get any of the Docker containers to start up at all. Every time I did
nomad run
, it would just get stuck in a "pending" state, even if I waited for 30 minutes:I ran
sudo docker images
to see if the images had been downloaded at all, since I suspected that's what it got stuck on, and saw this:So it looks like somewhere in the
docker pull
something stopped working, and it never got to the tagging state. In my case it seems to have been a DNS server that wasn't responding a lot of the time, causing a lot of DNS resolution timeouts. This causingdocker pull
to fail is maybe a Docker bug, but it would be nice if Nomad could somehow catch this happening so the allocation won't be pending forever.The text was updated successfully, but these errors were encountered: