Interrupted tasks in Docker fail restarting due to "container already exists" #2084

hoh · 2016-12-12T15:31:25Z

Nomad version

Nomad v0.5.0

Operating system and Environment details

Ubuntu 16.04 x86_64
Docker version 1.12.1, build 23cf638 (apt install docker.io)

Issue

When interrupting tasks on a client system, by rebooting or stopping Nomad and Docker services, the tasks fail to restart due to Nomad failing to start the tasks as it finds the previous container still present.

Reproduction steps

Setup two hosts, {1} with Nomad client+server and {2} with Nomad client
Start a job with count ≥2 so tasks are started in containers on both hosts
Restart host {2}
When Nomad tries to reschedule the tasks on {2}, their allocations fail with the following error:

12/12/16 15:21:54 UTC  Driver Failure  failed to start task 'mytask' for alloc 'cb2604b0-fafe-a05b-66f0-484caedba5ce': Failed to create container: container already exists

These tasks stay "Starting" and never switch to "Running" due the the above error. Removing the containers with docker rm $(docker ps -aq) allows them to start again.

For obvious reasons, adding a script that removes all Docker images on boot would not be a good solution.

Nomad Server or Client logs do not contain anything relevant to this error.

Related to parts of the discussion on #2016

The text was updated successfully, but these errors were encountered:

dadgar · 2016-12-12T17:28:35Z

Can you try RC2: We made quite a few improvements to the docker driver attempting to remedy this issue: https://releases.hashicorp.com/nomad/0.5.1-rc2/

hoh · 2016-12-15T09:48:02Z

Thanks ! Just tried and I could not reproduce the issue with RC2.

mlushpenko · 2016-12-22T15:33:27Z

Having the same core issue, although scenario is a bit different:

run services on nomad
redeploy the whole cluster which restarts docker daemon
nomad job stays in state pending and alloc-status shows

12/22/16 16:27:43 CET  Restarting      Task restarting in 25.881394077s
12/22/16 16:27:43 CET  Driver Failure  failed to start task 'traefik' for alloc '06fecd6d-81d9-7a16-5c82-c770743d68d8': Failed to create container: container already exists

After removing the container manually, nomad is able to reschedule

mlushpenko · 2016-12-22T15:57:54Z

Didn't help:

# nomad --version Nomad v0.5.1-rc2 ('6f2ccf22be738a31cb2153c7e43422c4ba9a0e3f+CHANGES')

# nomad status
ID               Type     Priority  Status
jenkins-master   service  50        dead
nexus            service  50        dead
registry         service  50        dead
selenium-chrome  service  50        dead
selenium-hub     service  50        dead
traefik          system   60        running

# nomad status -verbose traefik
ID          = traefik
Name        = traefik
Type        = system
Priority    = 60
Datacenters = amersfoort
Status      = running
Periodic    = false

Summary
Task Group     Queued  Starting  Running  Failed  Complete  Lost
loadbalancing  0       3         0        8       0         0

Evaluations
ID                                    Priority  Triggered By  Status    Placement Failures
34a26bf4-4147-4c25-0b8e-09881eabd0e0  60        job-register  complete  false

Allocations
ID                                    Eval ID                               Node ID                               Task Group     Desired  Status   Created At
d3b6a7d3-a436-f6ca-d470-f672fb164099  34a26bf4-4147-4c25-0b8e-09881eabd0e0  cdaaa02d-40ee-d341-197b-7eee724babfb  loadbalancing  run      pending  12/22/16 15:02:21 CET
f73affb6-6173-07b6-4e1a-1c80bcc6cd3c  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ece479e0-1740-aea8-88d7-5f93c57696fc  loadbalancing  run      pending  12/22/16 15:02:21 CET
06fecd6d-81d9-7a16-5c82-c770743d68d8  34a26bf4-4147-4c25-0b8e-09881eabd0e0  ee191d9f-509f-afc7-c096-54fc7c10c8bb  loadbalancing  run      pending  12/21/16 17:02:59 CET

# nomad alloc-status -verbose d3b6a7d3-a436-f6ca-d470-f672fb164099
ID                 = d3b6a7d3-a436-f6ca-d470-f672fb164099
Eval ID            = 34a26bf4-4147-4c25-0b8e-09881eabd0e0
Name               = traefik.loadbalancing[0]
Node ID            = cdaaa02d-40ee-d341-197b-7eee724babfb
Job ID             = traefik
Client Status      = pending
Client Description = <none>
Created At         = 12/22/16 15:02:21 CET
Evaluated Nodes    = 1
Filtered Nodes     = 0
Exhausted Nodes    = 0
Allocation Time    = 16.788µs
Failures           = 0

Task "traefik" is "pending"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
500 MHz  128 MiB  0 B   0     http: <IP>:9999
                              ui: <IP>:9998

Recent Events:
Time                   Type            Description
12/22/16 16:55:16 CET  Restarting      Task restarting in 30.382471444s
12/22/16 16:55:16 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:51 CET  Restarting      Task restarting in 25.16167252s
12/22/16 16:54:51 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:54:25 CET  Restarting      Task restarting in 25.722048638s
12/22/16 16:54:25 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:58 CET  Restarting      Task restarting in 27.098521822s
12/22/16 16:53:58 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists
12/22/16 16:53:29 CET  Restarting      Task restarting in 29.045220368s
12/22/16 16:53:29 CET  Driver Failure  failed to start task 'traefik' for alloc 'd3b6a7d3-a436-f6ca-d470-f672fb164099': Failed to create container: container already exists

Placement Metrics
  * Score "cdaaa02d-40ee-d341-197b-7eee724babfb.binpack" = 2.587686

I installed new version, removed old containers, nomad rescheduled everything, then I did redeployment of the whole cluster and nomad can't scedule them again.

dadgar · 2017-01-03T19:10:10Z

@mlushpenko What do you mean redeploy the whole cluster?

mlushpenko · 2017-01-05T14:15:36Z

@dadgar our pipeline invokes Ansible playbooks that deploy consul in the beginning, then nomad, dnsmasq and docker to several VMs.

When we deploy services like Gitlab, we invoke another pipeline via Jenkins that deploys containers to nomad cluster. But sometimes we need to update the cluster itself: docker/consul/nomad, then we invoke the "base" pipeline, but encounter issues mentioned above. Redeploy the whole cluster = invoke "base" pipeline.

dadgar · 2017-01-05T18:06:20Z

@mlushpenko When you run that base pipeline are you doing in-place upgrades of Nomad/Docker or starting a new VM?

Stopping the docker engine is not really advisable.

mlushpenko · 2017-01-05T18:18:01Z

@dadgar in-place upgrades - within client's legacy infrastructure spinning up VMs on-demand is not an option.

Docker is being restarted and reloaded if there are some changes to Docker config. Also, not sure if our playbooks are 100% idempotent right now, but I hope it's not a problem - in some other issues I saw that nomad shall handle VM failures and container failures (restarting Docker I would consider as temporary container failure).

weslleycamilo · 2017-11-13T19:02:45Z

Hello,

is there anyone which could help solve it ?

I got the same error reported on this issue.! I am using nomad 0.7.0

Here is the log I got from the UI.

dadgar · 2017-11-14T01:21:20Z

@weslleycamilo This is a regression due to Docker changing the returned error code and thus breaking our error handling. Will be resolved in #3513 which will be part of 0.7.1

weslleycamilo · 2017-11-14T13:41:03Z

@dadgar hmm great but what is the version which it still working ? Do you know it ? I tried the version 0.6.3 and got the same error.

dadgar · 2017-11-14T17:51:03Z

@weslleycamilo It depends on the Nomad and Docker Engine pairing. Docker changed their error message recently (not exactly sure on the version) and thus the error handling we have wasn't being triggered. The new error handling should be robust against both versions of the error message Docker returns.

weslleycamilo · 2017-11-14T18:41:33Z

@dadgar Do you know any nomad website which tells me which nomad and docker version are compatible?

It seems nomad 0.7.0 is not production ready.. it would be critical once I can't get the docker image back after docker restart or restart the host.

Can I keep using nomad 0.6.3 ? Which docker use to work with it ? At the moment I am testing nomad to go to production with it but i believe I dont have to stuck to this issue.

schmichael · 2017-11-15T00:02:54Z

@weslleycamilo There is a bug in Docker 17.09 that broke Nomad's name conflict code path. This meant on Docker daemon restarts Nomad would be unable to restart the container. I've attached a test binary to the PR if you're able to give a shot! #3551

weslleycamilo · 2017-11-16T13:10:42Z

Hi @schmichael ,
I've been testing it for a while and it is working properly now.

Great job.
Thank you.

Thank you @dadgar

Fuco1 · 2021-03-22T13:17:37Z

I still see this (lately extremly often, from 1000 dispatches 500 will keep failing with this error, even on freshly provisioned machines).

Docker version is Docker version 20.10.3, build 48d30b5 and Nomad v1.0.3 (08741d9f2003ec26e44c72a2c0e27cdf0eadb6ee).

Is there some setting to always force nomad to create new container? Since these are batch jobs there's no point of reusing the container or trying to re-attach.

tgross · 2021-03-22T13:20:52Z

@Fuco1 this issue was closed a very long time ago. Can you open a new bug report please?

github-actions · 2022-10-21T02:35:07Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added theme/driver/docker fixed-waiting-confirmation labels Dec 12, 2016

hoh closed this as completed Dec 15, 2016

github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interrupted tasks in Docker fail restarting due to "container already exists" #2084

Interrupted tasks in Docker fail restarting due to "container already exists" #2084

hoh commented Dec 12, 2016

dadgar commented Dec 12, 2016

hoh commented Dec 15, 2016

mlushpenko commented Dec 22, 2016

mlushpenko commented Dec 22, 2016

dadgar commented Jan 3, 2017

mlushpenko commented Jan 5, 2017

dadgar commented Jan 5, 2017

mlushpenko commented Jan 5, 2017

weslleycamilo commented Nov 13, 2017

dadgar commented Nov 14, 2017

weslleycamilo commented Nov 14, 2017

dadgar commented Nov 14, 2017

weslleycamilo commented Nov 14, 2017

schmichael commented Nov 15, 2017

weslleycamilo commented Nov 16, 2017

Fuco1 commented Mar 22, 2021 •

edited

Loading

tgross commented Mar 22, 2021

github-actions bot commented Oct 21, 2022

Interrupted tasks in Docker fail restarting due to "container already exists" #2084

Interrupted tasks in Docker fail restarting due to "container already exists" #2084

Comments

hoh commented Dec 12, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

dadgar commented Dec 12, 2016

hoh commented Dec 15, 2016

mlushpenko commented Dec 22, 2016

mlushpenko commented Dec 22, 2016

dadgar commented Jan 3, 2017

mlushpenko commented Jan 5, 2017

dadgar commented Jan 5, 2017

mlushpenko commented Jan 5, 2017

weslleycamilo commented Nov 13, 2017

dadgar commented Nov 14, 2017

weslleycamilo commented Nov 14, 2017

dadgar commented Nov 14, 2017

weslleycamilo commented Nov 14, 2017

schmichael commented Nov 15, 2017

weslleycamilo commented Nov 16, 2017

Fuco1 commented Mar 22, 2021 • edited Loading

tgross commented Mar 22, 2021

github-actions bot commented Oct 21, 2022

Fuco1 commented Mar 22, 2021 •

edited

Loading