Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupted tasks in Docker fail restarting due to "container already exists" #2084

Closed
hoh opened this issue Dec 12, 2016 · 18 comments
Closed

Comments

@hoh
Copy link
Contributor

hoh commented Dec 12, 2016

Nomad version

Nomad v0.5.0

Operating system and Environment details

Ubuntu 16.04 x86_64
Docker version 1.12.1, build 23cf638 (apt install docker.io)

Issue

When interrupting tasks on a client system, by rebooting or stopping Nomad and Docker services, the tasks fail to restart due to Nomad failing to start the tasks as it finds the previous container still present.

Reproduction steps

  • Setup two hosts, {1} with Nomad client+server and {2} with Nomad client
  • Start a job with count ≥2 so tasks are started in containers on both hosts
  • Restart host {2}
  • When Nomad tries to reschedule the tasks on {2}, their allocations fail with the following error:
12/12/16 15:21:54 UTC  Driver Failure  failed to start task 'mytask' for alloc 'cb2604b0-fafe-a05b-66f0-484caedba5ce': Failed to create container: container already exists
  • These tasks stay "Starting" and never switch to "Running" due the the above error. Removing the containers with docker rm $(docker ps -aq) allows them to start again.

For obvious reasons, adding a script that removes all Docker images on boot would not be a good solution.

Nomad Server or Client logs do not contain anything relevant to this error.

Related to parts of the discussion on #2016

@dadgar
Copy link
Contributor

dadgar commented Dec 12, 2016

Can you try RC2: We made quite a few improvements to the docker driver attempting to remedy this issue: https://releases.hashicorp.com/nomad/0.5.1-rc2/

@hoh
Copy link
Contributor Author

hoh commented Dec 15, 2016

Thanks ! Just tried and I could not reproduce the issue with RC2.

@hoh hoh closed this as completed Dec 15, 2016
@mlushpenko
Copy link

Having the same core issue, although scenario is a bit different:

  • run services on nomad
  • redeploy the whole cluster which restarts docker daemon
  • nomad job stays in state pending and alloc-status shows
12/22/16 16:27:43 CET  Restarting      Task restarting in 25.881394077s
12/22/16 16:27:43 CET  Driver Failure  failed to start task 'traefik' for alloc '06fecd6d-81d9-7a16-5c82-c770743d68d8': Failed to create container: container already exists
  • After removing the container manually, nomad is able to reschedule

@mlushpenko
Copy link

Didn't help:

# nomad --version Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')

# nomad status
ID               Type     Priority  Status
jenkins-master   service  50        dead
nexus            service  50        dead
registry         service  50        dead
selenium-chrome  service  50        dead
selenium-hub     service  50        dead
traefik          system   60        running
# nomad status -verbose traefik
ID          = traefik
Name        = traefik
Type        = system
Priority    = 60
Datacenters = amersfoort
Status      = running
Periodic    = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
loadbalancing  0       3         0        8       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
34a26bf4-4147-4c25-0b8e-09881eabd0e0  60        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group     Desired  Status   Created At
d3b6a7d3-a436-f6ca-d470-f672fb164099  34a26bf4-4147-4c25-0b8e-09881eabd0e0  cdaaa02d-40ee-d341-197b-7eee724babfb  loadbalancing  run      pending  12/22/16 15:02:21 CET
f73affb6-6173-07b6-4e1a-1c80bcc6cd3c  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ece479e0-1740-aea8-88d7-5f93c57696fc  loadbalancing  run      pending  12/22/16 15:02:21 CET
06fecd6d-81d9-7a16-5c82-c770743d68d8  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ee191d9f-509f-afc7-c096-54fc7c10c8bb  loadbalancing  run      pending  12/21/16 17:02:59 CET
# nomad alloc-status -verbose d3b6a7d3-a436-f6ca-d470-f672fb164099
ID                 = d3b6a7d3-a436-f6ca-d470-f672fb164099
Eval ID            = 34a26bf4-4147-4c25-0b8e-09881eabd0e0
Name               = traefik.loadbalancing[0]
Node ID            = cdaaa02d-40ee-d341-197b-7eee724babfb
Job ID             = traefik
Client Status      = pending
Client Description = <none>
Created At         = 12/22/16 15:02:21 CET
Evaluated Nodes    = 1
Filtered Nodes     = 0
Exhausted Nodes    = 0
Allocation Time    = 16.788µs
Failures           = 0

Task "traefik" is "pending"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
500 MHz  128 MiB  0 B   0     http: <IP>:9999
                              ui: <IP>:9998

Recent Events:
Time                   Type            Description
12/22/16 16:55:16 CET  Restarting      Task restarting in 30.382471444s
12/22/16 16:55:16 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:51 CET  Restarting      Task restarting in 25.16167252s
12/22/16 16:54:51 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:25 CET  Restarting      Task restarting in 25.722048638s
12/22/16 16:54:25 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:58 CET  Restarting      Task restarting in 27.098521822s
12/22/16 16:53:58 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:29 CET  Restarting      Task restarting in 29.045220368s
12/22/16 16:53:29 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists

Placement Metrics
  * Score "cdaaa02d-40ee-d341-197b-7eee724babfb.binpack" = 2.587686

I installed new version, removed old containers, nomad rescheduled everything, then I did redeployment of the whole cluster and nomad can't scedule them again.

@dadgar
Copy link
Contributor

dadgar commented Jan 3, 2017

@mlushpenko What do you mean redeploy the whole cluster?

@mlushpenko
Copy link

@dadgar our pipeline invokes Ansible playbooks that deploy consul in the beginning, then nomad, dnsmasq and docker to several VMs.

When we deploy services like Gitlab, we invoke another pipeline via Jenkins that deploys containers to nomad cluster. But sometimes we need to update the cluster itself: docker/consul/nomad, then we invoke the "base" pipeline, but encounter issues mentioned above. Redeploy the whole cluster = invoke "base" pipeline.

@dadgar
Copy link
Contributor

dadgar commented Jan 5, 2017

@mlushpenko When you run that base pipeline are you doing in-place upgrades of Nomad/Docker or starting a new VM?

Stopping the docker engine is not really advisable.

@mlushpenko
Copy link

@dadgar in-place upgrades - within client's legacy infrastructure spinning up VMs on-demand is not an option.

Docker is being restarted and reloaded if there are some changes to Docker config. Also, not sure if our playbooks are 100% idempotent right now, but I hope it's not a problem - in some other issues I saw that nomad shall handle VM failures and container failures (restarting Docker I would consider as temporary container failure).

@weslleycamilo
Copy link

Hello,

is there anyone which could help solve it ?

I got the same error reported on this issue.! I am using nomad 0.7.0

Here is the log I got from the UI.

image

@dadgar
Copy link
Contributor

dadgar commented Nov 14, 2017

@weslleycamilo This is a regression due to Docker changing the returned error code and thus breaking our error handling. Will be resolved in #3513 which will be part of 0.7.1

@weslleycamilo
Copy link

@dadgar hmm great but what is the version which it still working ? Do you know it ? I tried the version 0.6.3 and got the same error.

@dadgar
Copy link
Contributor

dadgar commented Nov 14, 2017

@weslleycamilo It depends on the Nomad and Docker Engine pairing. Docker changed their error message recently (not exactly sure on the version) and thus the error handling we have wasn't being triggered. The new error handling should be robust against both versions of the error message Docker returns.

@weslleycamilo
Copy link

@dadgar Do you know any nomad website which tells me which nomad and docker version are compatible?

It seems nomad 0.7.0 is not production ready.. it would be critical once I can't get the docker image back after docker restart or restart the host.

Can I keep using nomad 0.6.3 ? Which docker use to work with it ? At the moment I am testing nomad to go to production with it but i believe I dont have to stuck to this issue.

@schmichael
Copy link
Member

@weslleycamilo There is a bug in Docker 17.09 that broke Nomad's name conflict code path. This meant on Docker daemon restarts Nomad would be unable to restart the container. I've attached a test binary to the PR if you're able to give a shot! #3551

@weslleycamilo
Copy link

Hi @schmichael ,
I've been testing it for a while and it is working properly now.

Great job.
Thank you.

Thank you @dadgar

@Fuco1
Copy link
Contributor

Fuco1 commented Mar 22, 2021

I still see this (lately extremly often, from 1000 dispatches 500 will keep failing with this error, even on freshly provisioned machines).

Docker version is Docker version 20.10.3, build 48d30b5 and Nomad v1.0.3 (08741d9f2003ec26e44c72a2c0e27cdf0eadb6ee).

Is there some setting to always force nomad to create new container? Since these are batch jobs there's no point of reusing the container or trying to re-attach.

@tgross
Copy link
Member

tgross commented Mar 22, 2021

@Fuco1 this issue was closed a very long time ago. Can you open a new bug report please?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants