-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interrupted tasks in Docker fail restarting due to "container already exists" #2084
Comments
Can you try RC2: We made quite a few improvements to the docker driver attempting to remedy this issue: https://releases.hashicorp.com/nomad/0.5.1-rc2/ |
Thanks ! Just tried and I could not reproduce the issue with RC2. |
Having the same core issue, although scenario is a bit different:
|
Didn't help:
I installed new version, removed old containers, nomad rescheduled everything, then I did redeployment of the whole cluster and nomad can't scedule them again. |
@mlushpenko What do you mean redeploy the whole cluster? |
@dadgar our pipeline invokes Ansible playbooks that deploy consul in the beginning, then nomad, dnsmasq and docker to several VMs. When we deploy services like Gitlab, we invoke another pipeline via Jenkins that deploys containers to nomad cluster. But sometimes we need to update the cluster itself: docker/consul/nomad, then we invoke the "base" pipeline, but encounter issues mentioned above. Redeploy the whole cluster = invoke "base" pipeline. |
@mlushpenko When you run that base pipeline are you doing in-place upgrades of Nomad/Docker or starting a new VM? Stopping the docker engine is not really advisable. |
@dadgar in-place upgrades - within client's legacy infrastructure spinning up VMs on-demand is not an option. Docker is being restarted and reloaded if there are some changes to Docker config. Also, not sure if our playbooks are 100% idempotent right now, but I hope it's not a problem - in some other issues I saw that nomad shall handle VM failures and container failures (restarting Docker I would consider as temporary container failure). |
@weslleycamilo This is a regression due to Docker changing the returned error code and thus breaking our error handling. Will be resolved in #3513 which will be part of 0.7.1 |
@dadgar hmm great but what is the version which it still working ? Do you know it ? I tried the version 0.6.3 and got the same error. |
@weslleycamilo It depends on the Nomad and Docker Engine pairing. Docker changed their error message recently (not exactly sure on the version) and thus the error handling we have wasn't being triggered. The new error handling should be robust against both versions of the error message Docker returns. |
@dadgar Do you know any nomad website which tells me which nomad and docker version are compatible? It seems nomad 0.7.0 is not production ready.. it would be critical once I can't get the docker image back after docker restart or restart the host. Can I keep using nomad 0.6.3 ? Which docker use to work with it ? At the moment I am testing nomad to go to production with it but i believe I dont have to stuck to this issue. |
@weslleycamilo There is a bug in Docker 17.09 that broke Nomad's name conflict code path. This meant on Docker daemon restarts Nomad would be unable to restart the container. I've attached a test binary to the PR if you're able to give a shot! #3551 |
Hi @schmichael , Great job. Thank you @dadgar |
I still see this (lately extremly often, from 1000 dispatches 500 will keep failing with this error, even on freshly provisioned machines). Docker version is Is there some setting to always force nomad to create new container? Since these are batch jobs there's no point of reusing the container or trying to re-attach. |
@Fuco1 this issue was closed a very long time ago. Can you open a new bug report please? |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.5.0
Operating system and Environment details
Ubuntu 16.04 x86_64
Docker version 1.12.1, build 23cf638 (
apt install docker.io
)Issue
When interrupting tasks on a client system, by rebooting or stopping Nomad and Docker services, the tasks fail to restart due to Nomad failing to start the tasks as it finds the previous container still present.
Reproduction steps
docker rm $(docker ps -aq)
allows them to start again.For obvious reasons, adding a script that removes all Docker images on boot would not be a good solution.
Nomad Server or Client logs do not contain anything relevant to this error.
Related to parts of the discussion on #2016
The text was updated successfully, but these errors were encountered: