Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

xytis · 2017-03-06T12:35:02Z

Nomad version

Nomad v0.5.4

Operating system and Environment details

$ uname -a
Linux docker-06 3.10.0-514.2.2.el7.x86_64 #1 SMP Tue Dec 6 23:06:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ docker version
Server:
 Version:      1.12.6
 API version:  1.24
 Go version:   go1.6.4
 Git commit:   78d1802
 Built:        Tue Jan 10 20:20:01 2017
 OS/Arch:      linux/amd64

Issue

Reproduced #1814

Reproduction steps

Happens in production. Not reproduced in test environment, suspect that nomad performs allocation on a port which was just released from previous allocation.

Logs

$ journalctl
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11.913373 [DEBUG] driver.docker: docker pull <REDACTED> succeeded
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: starting plugin: /usr/sbin/nomad []string{"/usr/sbin/nomad", "executor", "{\"LogFile\":\"/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/executor.out\",\"LogLevel\":\"DEBUG\"}"}
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: waiting for RPC address for: /usr/sbin/nomad
Mar 06 11:40:11 docker-06 nomad[77332]: 2017/03/06 11:40:11 [DEBUG] plugin: nomad: 2017/03/06 11:40:11 [DEBUG] plugin: plugin address: unix /tmp/plugin543102755
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002221 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin324275750
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002268 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin324275750]}
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002276 [DEBUG] driver.docker: using 268435456 bytes memory for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002281 [DEBUG] driver.docker: using 500 cpu shares for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002301 [DEBUG] driver.docker: binding directories []string{"/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/alloc:/alloc", "/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/local:/local", "/var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/secrets:/secrets"} for <REDACTED>
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002316 [DEBUG] driver.docker: allocated port 10.5.244.35:26136 -> 80 (mapped)
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002329 [DEBUG] driver.docker: exposed port 80
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.002407 [DEBUG] driver.docker: setting container name to: <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.005095 [DEBUG] http: Request /v1/node/c9aa33c0-1b3f-1f11-3857-f1bf10e41408/allocations (235.275834ms)
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.026095 [INFO] driver.docker: created container 0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e
Mar 06 11:40:12 docker-06 telegraf[147298]: 2017-03-06T11:40:12Z E! ERROR in input [inputs.exec]: Errors encountered: [exec: Command timed out. for command '/usr/local/sbin/parse_nomad.sh']
Mar 06 11:40:12 docker-06 dockerd[3071]: time="2017-03-06T11:40:12.474919534Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:12 docker-06 dockerd[3071]: time="2017-03-06T11:40:12.560294851Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (34696001c3864bc0dbfda1b74422a6b12c46b1d2e81450066407ea0bb42d49c8): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:12 docker-06 nomad[77332]: 2017/03/06 11:40:12.560537 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 1): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (34696001c3864bc0dbfda1b74422a6b12c46b1d2e81450066407ea0bb42d49c8): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:14 docker-06 dockerd[3071]: time="2017-03-06T11:40:14.016189913Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:14 docker-06 dockerd[3071]: time="2017-03-06T11:40:14.101933907Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (382c6dc0b84f485d3b6c3bd80f9f0743b6f7a16e79fb440aa000e98ed54ee9d3): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:14 docker-06 nomad[77332]: 2017/03/06 11:40:14.102148 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 2): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (382c6dc0b84f485d3b6c3bd80f9f0743b6f7a16e79fb440aa000e98ed54ee9d3): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:15 docker-06 telegraf[147298]: 2017-03-06T11:40:15Z E! ERROR: input [inputs.procstat] took longer to collect than collection interval (10s)
Mar 06 11:40:15 docker-06 dockerd[3071]: time="2017-03-06T11:40:15.556613480Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:15 docker-06 dockerd[3071]: time="2017-03-06T11:40:15.638172362Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (9a082444d249b134693efc07026bb351c20075bc117fa1038ca9d731c24a73c5): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:15 docker-06 nomad[77332]: 2017/03/06 11:40:15.638367 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 3): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (9a082444d249b134693efc07026bb351c20075bc117fa1038ca9d731c24a73c5): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:17 docker-06 dockerd[3071]: time="2017-03-06T11:40:17.094884796Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:17 docker-06 dockerd[3071]: time="2017-03-06T11:40:17.172650262Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (2384c371c651564a75779b5f263a8876bdb954e8b4f6c56ce71e314f3c572845): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:17 docker-06 nomad[77332]: 2017/03/06 11:40:17.172823 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 4): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (2384c371c651564a75779b5f263a8876bdb954e8b4f6c56ce71e314f3c572845): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:18 docker-06 dockerd[3071]: time="2017-03-06T11:40:18.635416443Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:18 docker-06 dockerd[3071]: time="2017-03-06T11:40:18.703832987Z" level=error msg="Handler for POST /containers/0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e/start returned error: driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (804fa1f7d69385c8ccdcf2e468b8e26f6e4dd1d2549c657bc6db52a48b6faca9): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:18 docker-06 nomad[77332]: 2017/03/06 11:40:18.704044 [DEBUG] driver.docker: failed to start container "0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e" (attempt 5): API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (804fa1f7d69385c8ccdcf2e468b8e26f6e4dd1d2549c657bc6db52a48b6faca9): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
Mar 06 11:40:20 docker-06 nomad[77332]: 2017/03/06 11:40:20.128229 [DEBUG] http: Request /v1/agent/self (565.367µs)
Mar 06 11:40:20 docker-06 dockerd[3071]: time="2017-03-06T11:40:20.146648650Z" level=warning msg="Failed to allocate and map port 26136-26136: Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"
Mar 06 11:40:20 docker-06 nomad[77332]: 2017/03/06 11:40:20.186162 [DEBUG] http: Request /v1/nodes?prefix=c9aa33c0-1b3f-1f11-3857-f1bf10e41408 (55.405756ms)

$ cat /var/nomad/alloc/a8cc717c-a557-c4a4-6dd8-0e269506a31f/<REDACTED>/executor.out
2017/03/06 11:40:12.000996 [DEBUG] sylog-server: launching syslog server on addr: /tmp/plugin324275750

$ nomad alloc-status a8cc717c
ID                  = a8cc717c
Eval ID             = ae9a7ce2
Name                = <REDACTED>.<REDACTED>[3]
Node ID             = c9aa33c0
Job ID              = <REDACTED>
Client Status       = failed
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 03/06/17 11:39:56 UTC

Task "<REDACTED>" is "dead"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
500 MHz  256 MiB  0 B   0     http: 10.5.244.35:26136

Recent Events:
Time                   Type            Description
03/06/17 11:40:20 UTC  Not Restarting  Error was unrecoverable
03/06/17 11:40:20 UTC  Driver Failure  failed to start task "<REDACTED>" for alloc "a8cc717c-a557-c4a4-6dd8-0e269506a31f": Failed to start container 0019ed6ea9f3716823bf3242d52051723db38a4c16497b58efd258c0f5790d1e: API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-a8cc717c-a557-c4a4-6dd8-0e269506a31f (a4cedc9e5c29631b83321bd15ca1a6fb1a7da4057f1b3e871d8146ed239e8e38): Error starting userland proxy: listen tcp 10.5.244.35:26136: bind: address already in use"}
03/06/17 11:39:57 UTC  Driver          Downloading image <REDACTED>
03/06/17 11:39:57 UTC  Received        Task received by client

Note that nomad client is not responsive during the deploy attempts (that might be a separate bug)

$ cat /usr/local/sbin/parse_nomad.sh
#!/bin/bash
export NOMAD_ADDR=https://localhost:4646
export NOMAD_CLIENT_CERT=/etc/nomad/tls/nomad.pem
export NOMAD_CLIENT_KEY=/etc/nomad/tls/nomad-key.pem
export NOMAD_CACERT=/etc/nomad/tls/ca.pem
STR=$(nomad node-status -self | grep Allocated -A 2 )

echo "${STR}" |  awk -F'[ /]+' 'NR == "3" {print "{\"cur_CPU\": "$1",\"max_CPU\": "$2",\"cur_Mem\": "$4",\"max_Mem\": "$6",\"cur_Disc\": "$8",\"max_Disc\": "$10"}"}'

The text was updated successfully, but these errors were encountered:

gokhansengun · 2017-03-07T19:49:08Z

+1 to this, happening time-to-time in my test cluster too. Was not lucky enough yet to see what was wrong but suspect that Nomad is trying to allocate a TCP port in a TIME_WAIT state. I see this issue in v0.5.2

dadgar · 2017-03-10T06:12:15Z

@xytis Can you show the output of cat /proc/sys/net/ipv4/ip_local_port_range

xytis · 2017-03-10T08:01:31Z

[root@docker-06 vytis]# cat /proc/sys/net/ipv4/ip_local_port_range
10000	65000

dadgar · 2017-03-13T19:50:11Z

@xytis It might be too late but is anything bound to that port?

xytis · 2017-03-13T19:53:29Z

I can not be certain. But host is running docker and nomad. I suspected they could have made the collision happen on their own, due to some race condition.

At the time of reporting, the port was free.

gokhansengun · 2017-03-14T05:47:56Z

Happened to me again. This time after an allocation was unsuccessful in the first attempt for some other reason and then continuing in the second attempt. Second attempt resulted in with this issue.

dadgar · 2017-03-14T06:51:04Z

@gokhansengun can you share the output of nomad alloc-status <id>

gokhansengun · 2017-03-14T07:07:03Z

@dadgar I am running a test cluster where I test lots of things. Sorry cleaned everything for this time but happening nearly daily, will update here with the info.

gokhansengun · 2017-03-14T11:28:28Z

@dadgar did not need to wait until tomorrow, happened again, logs below.

ubuntu@worker-node-01:~/mgmt/scripts$ nomad alloc-status baa659f8
ID                 = baa659f8
Eval ID            = bd5361d2
Name               = backoffice-portal.backoffice-portal[0]
Node ID            = 092f4d9f
Job ID             = backoffice-portal
Client Status      = failed
Client Description = <none>
Created At         = 03/14/17 11:18:21 UTC

Task "backoffice-portal" is "dead"
Task Resources
CPU      Memory   Disk  IOPS  Addresses
600 MHz  512 MiB  0 B   0     web: 172.22.44.121:46814

Recent Events:
Time                   Type            Description
03/14/17 11:18:57 UTC  Not Restarting  Error was unrecoverable
03/14/17 11:18:57 UTC  Driver Failure  failed to start task 'backoffice-portal' for alloc 'baa659f8-2243-7b64-2681-8cad79b462c8': Failed to start container 909acca03edc1e608d46207bfe84c42e7acaa96b77bf22024742569512c71169: API error (500): {"message":"driver failed programming external connectivity on endpoint backoffice-portal-baa659f8-2243-7b64-2681-8cad79b462c8 (7ee29894bb1aeca83d70110b658b7906b9bc669919f99261ab1ee7a29f7aec4c): Error starting userland proxy: listen tcp 172.22.44.121:46814: bind: address already in use"}
03/14/17 11:18:22 UTC  Received        Task received by client

CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                                                        NAMES
909acca03edc        gsengun/bio-id-backoffice-portal   "/bin/bash /script..."   7 minutes ago       Created                                                                                                                                                                                                          backoffice-portal-baa659f8-2243-7b64-2681-8cad79b462c8
8e10af305af5        mongo:3.4.0                        "/entrypoint.sh mo..."   10 minutes ago      Up 10 minutes       172.22.44.121:27017->27017/tcp, 172.22.44.121:27017->27017/udp                                                                                                                               mongodb-cluster-a71970cb-1479-fcc0-65d8-66415b21ea79

ubuntu@worker-node-01:~/mgmt/scripts$ sudo netstat -p | grep 46814
tcp        0      0 172.22.44.121:46814     172.22.44.121:8500      ESTABLISHED 2831/nomad
tcp6       0      0 172.22.44.121:8500      172.22.44.121:46814     ESTABLISHED 2748/consul

dadgar · 2017-03-14T17:22:54Z

@gokhansengun If the allocations are all still there or next time it happens can you run:

$ curl http://127.0.0.1:4646/v1/node/<node-id>/allocations?pretty=true
$ sudo netstat -nlp

gokhansengun · 2017-03-15T20:15:44Z

@dadgar added logs to secret Gist and shared with you in Gitter.

ghost · 2017-05-20T21:57:46Z

We see this issue about once a week in our production env. I would like to see Nomad intercept this error and attempt to restart the container on another random port. Is this possible?

dadgar · 2017-05-24T18:48:09Z

@gokhansengun Did changing the available port range fix this for you? We talked about that on Gitter.

gokhansengun · 2017-05-24T23:33:15Z

@dadgar sorry Alex, I did not update you guys on this. I waited 0.5.6 before doing the change you suggested. Then all of a sudden, issue has not occurred again. I am not sure if it was < 0.5.6 or the refactoring I did all over the job dispatching.

dadgar · 2017-05-25T15:53:02Z

@gokhansengun Thanks for the update.

@grobinson-blockchain What version of Nomad are you on?

ghost · 2017-05-25T16:06:29Z

We are on Nomad v0.5.6.

anishmashankar · 2017-07-12T13:12:39Z

We are facing the same issue on Nomad v0.5.6. A job with 120 task-groups with one task each. One of the tasks did not start for the same reason.

djenriquez · 2018-05-31T20:07:43Z

Not sure if this is fixed. Running Nomad v0.8.0 saw this in a deployment of a docker task with a count of 70:

nomad -v
Nomad v0.8.0 (751b42aaa60f1bc679e5062e16db7d15f8ff9654)

docker -v
Docker version 17.12.1-ce, build 3dfb8343b139d6342acfd9975d7f1068b5b1c3d3

Failed to start container 25d554b3682d850580bb41d223cb82db891932f5511e191bad822b11c11379c9: API error (500): {"message":"driver failed programming external connectivity on endpoint <REDACTED>-9b51d619-f7ea-105d-c6a0-40fec309c38e (4c0681fe8b3a36ae62bdf225a2d2531bf6b494268852763534585860e78e0544): Error starting userland proxy: listen tcp 10.7.10.154:28268: bind: address already in use

I can't get this thing to get out of the pending state =/. Had to drain the node it was on then disable the drain. Any ideas?

github-actions · 2022-11-30T02:17:40Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added type/bug theme/driver/docker labels Mar 6, 2017

schmichael mentioned this issue Mar 29, 2017

Nomad can allocate port which is already occupied by some outgoing connection #2500

Closed

dadgar mentioned this issue Jul 18, 2017

Shrink dynamic port range #2856

Merged

dadgar closed this as completed in #2856 Jul 19, 2017

djenriquez mentioned this issue May 31, 2018

bind: address already in use - 0.8.0 #4360

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

xytis commented Mar 6, 2017

gokhansengun commented Mar 7, 2017 •

edited

Loading

dadgar commented Mar 10, 2017

xytis commented Mar 10, 2017

dadgar commented Mar 13, 2017

xytis commented Mar 13, 2017

gokhansengun commented Mar 14, 2017

dadgar commented Mar 14, 2017

gokhansengun commented Mar 14, 2017

gokhansengun commented Mar 14, 2017

dadgar commented Mar 14, 2017

gokhansengun commented Mar 15, 2017

ghost commented May 20, 2017

dadgar commented May 24, 2017

gokhansengun commented May 24, 2017

dadgar commented May 25, 2017

ghost commented May 25, 2017

anishmashankar commented Jul 12, 2017

djenriquez commented May 31, 2018 •

edited

Loading

github-actions bot commented Nov 30, 2022

Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

Scheduling Jobs with Docker Fails with Error "bind: address already in use" #2402

Comments

xytis commented Mar 6, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Logs

gokhansengun commented Mar 7, 2017 • edited Loading

dadgar commented Mar 10, 2017

xytis commented Mar 10, 2017

dadgar commented Mar 13, 2017

xytis commented Mar 13, 2017

gokhansengun commented Mar 14, 2017

dadgar commented Mar 14, 2017

gokhansengun commented Mar 14, 2017

gokhansengun commented Mar 14, 2017

dadgar commented Mar 14, 2017

gokhansengun commented Mar 15, 2017

ghost commented May 20, 2017

dadgar commented May 24, 2017

gokhansengun commented May 24, 2017

dadgar commented May 25, 2017

ghost commented May 25, 2017

anishmashankar commented Jul 12, 2017

djenriquez commented May 31, 2018 • edited Loading

github-actions bot commented Nov 30, 2022

gokhansengun commented Mar 7, 2017 •

edited

Loading

djenriquez commented May 31, 2018 •

edited

Loading