Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad client hangs indefinitely on startup #1202

Closed
ghost opened this issue May 24, 2016 · 19 comments
Closed

Nomad client hangs indefinitely on startup #1202

ghost opened this issue May 24, 2016 · 19 comments

Comments

@ghost
Copy link

ghost commented May 24, 2016

Nomad version

Nomad v.0.3.2

Issue

The Nomad client hangs on Starting Nomad agent... at startup. It never makes progress.

● nomad.service - nomad
   Loaded: loaded (/etc/systemd/system/nomad.service; enabled)
   Active: active (running) since Tue 2016-05-24 11:13:51 UTC; 1min 50s ago
 Main PID: 7206 (nomad)
   CGroup: /system.slice/nomad.service
           └─7206 /usr/local/bin/nomad agent -config /etc/nomad.d

May 24 11:13:51 backend-eu-west-1 systemd[1]: Started nomad.
May 24 11:13:51 backend-eu-west-1 nomad[7206]: Loaded configuration from /etc/nomad.d/config.json
May 24 11:13:51 backend-eu-west-1 nomad[7206]: ==> Starting Nomad agent...

When started via an interactive terminal:

/usr/local/bin/nomad agent -config /etc/nomad.d -log-level=DEBUG
    Loaded configuration from /etc/nomad.d/config.json
    ==> Starting Nomad agent...

Nomad configuration file:

{
  "advertise": {
    "http": "10.133.0.5:4646",
    "rpc": "10.133.0.5:4647",
    "serf": "10.133.0.5:4648"
  },
  "bind_addr": "0.0.0.0",
  "client": {
    "enabled": true,
    "servers": [
      "consul-master-eu-west-1",
      "consul-master-eu-west-2",
      "consul-master-eu-west-3"
    ]
  },
  "data_dir": "/var/nomad",
  "datacenter": "europe-west1",
  "region": "europe"
}
@camerondavison
Copy link
Contributor

I have seen this a couple of times. I turned on DEBUG logging, and still nothing. The only way I could get it to start up was to start killing existing docker containers that were running.

@dadgar
Copy link
Contributor

dadgar commented May 24, 2016

Can you give background on to what the client was running, how it was restarted? If you can give reproduction steps that would be awesome!

@ghost
Copy link
Author

ghost commented May 24, 2016

@dadgar so I can reproduce this error under the same conditions reported by @a86c6f7964. If you restart the Nomad client whilst you have running containers and those containers continue to run after Nomad has exited then the Nomad client is unable to start back up.

@diptanu
Copy link
Contributor

diptanu commented Jun 12, 2016

@grobinson-blockchain @a86c6f7964 So I started an agent in the server mode and another in the client mode, ran a redis container(the one that nomad init generates) and killed the client. At that point the nomad executor and container were still running, then I restarted the client, which connected back with the executor and I could even stop the container.

Can I get some more help in reproducing this?

@camerondavison
Copy link
Contributor

I can try again. But I think that it has to do with having many containers
running.

On Sun, Jun 12, 2016, 9:35 AM Diptanu Choudhury [email protected]
wrote:

@grobinson-blockchain https://github.com/grobinson-blockchain
@a86c6f7964 https://github.com/a86c6f7964 So I started an agent in the
server mode and another in the client mode, ran a redis container(the one
that nomad init generates) and killed the client. At that point the nomad
executor and container were still running, then I restarted the client,
which connected back with the executor and I could even stop the container.

Can I get some more help in reproducing this?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#1202 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBVWf6-Ab-T1JhbiaQbg02l6H7pNAks5qLBk5gaJpZM4IlYOS
.

@wuub
Copy link
Contributor

wuub commented Jun 15, 2016

I can try again. But I think that it has to do with having many containers running.

Not necessarily. I've got a very simple cluster (3 server nodes, 1 client node). With a single system job and I'm experiencing identical behaviour.

Nomad Agent

Nomad v0.4.0-dev ('bc09a0444722617a3a0ee0daa28d24b93d9d3e5b+CHANGES')

Nomad Config

region = "us"
datacenter = "us-central1"

bind_addr = "0.0.0.0"
data_dir = "/var/nomad"

advertise {
    http = "10.xxx.0.39:4646"
    rpc = "10.xxx.0.39:4647"
}

client {
    enabled = true
    network_speed = 10000  # AFAICT GCE has 10GBe network
    servers = ["consul-1-us-central1", "consul-2-us-central1", "consul-3-us-central1"]
    options {
        "docker.auth.config" = "/root/.docker/config.json"
        "driver.raw_exec.enable" = "1"
    }
}

Job Spec

job "internal-fluentd" {
    region = "us"
    datacenters = ["us-central1"]

    type = "system"

    group "fluentd" {
        task "fluentd" {
            driver = "docker"

            config {
                image = "gcr.io/.../fluentd:emiter"
                port_map {
                    forward = 24224
                    http = 9880
                    syslog = 5140
                }
            }

            service {
                name = "${TASKGROUP}"
                tags = ["emiter"]
                port = "http"
                check {
                    name = "alive"
                    type = "tcp"
                    interval = "10s"
                    timeout = "2s"
                }
            }

            resources {
                cpu = 500 
                memory = 128 
                network {
                    mbits = 10
                    port "http" {
                    }
                    port "forward" {    
                    }
                }
            }
        }
    }
}

Initially everything works as expected, Nomad starts and launched system job:

Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using state directory /var/nomad/client
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:42:36 apps-1 nomad[1432]:     2016/06/15 17:42:36 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:43 apps-1 nomad[1432]:     2016/06/15 17:42:43 [INFO] driver.docker: stopped container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:49:53 apps-1 systemd[1]: Stopping Nomad Agent...
Jun 15 17:49:53 apps-1 nomad[1432]: ==> Caught signal: interrupt
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: requesting shutdown
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client.rpcproxy: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: RPC failed to server 10.xxx.0.35:4647: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: shutdown complete
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 15 17:49:53 apps-1 systemd[1]: Stopped Nomad Agent.

As seen above, I stopped it w/ SIGINT.

System job docker container is present after nomad has exited:

root@apps-1:/home/wuub# docker ps
CONTAINER ID        IMAGE                                           COMMAND                CREATED             STATUS              PORTS                                                                                                                            NAMES
7207b7c63d28        gcr.io/internal/internal-fluentd:emiter   "/bin/sh -c fluentd"   7 minutes ago       Up 7 minutes        5140/tcp, 10.xxx.0.39:21284->9880/tcp, 10.xxx.0.39:21284->9880/udp, 10.xxx.0.39:54833->24224/tcp, 10.xxx.0.39:54833->24224/udp   fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489

When I issue systemctl nomad start things don't go according to plan

-- Logs begin at Wed 2016-06-15 17:42:16 UTC, end at Wed 2016-06-15 17:50:27 UTC. --
Jun 15 17:42:21 apps-1 systemd[1]: Started Nomad Agent.
Jun 15 17:42:21 apps-1 nomad[1432]:     Loaded configuration from /etc/nomad.d/nomad.hcl
Jun 15 17:42:21 apps-1 nomad[1432]: ==> Starting Nomad agent...
Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent configuration:
Jun 15 17:42:23 apps-1 nomad[1432]:                  Atlas: <disabled>
Jun 15 17:42:23 apps-1 nomad[1432]:                 Client: true
Jun 15 17:42:23 apps-1 nomad[1432]:              Log Level: INFO
Jun 15 17:42:23 apps-1 nomad[1432]:                 Region: us (DC: us-central1)
Jun 15 17:42:23 apps-1 nomad[1432]:                 Server: false
Jun 15 17:42:23 apps-1 nomad[1432]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using state directory /var/nomad/client
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:42:23 apps-1 nomad[1432]:     2016/06/15 17:42:21 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:42:36 apps-1 nomad[1432]:     2016/06/15 17:42:36 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:37 apps-1 nomad[1432]:     2016/06/15 17:42:37 [INFO] driver.docker: created container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [INFO] driver.docker: started container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:42:38 apps-1 nomad[1432]:     2016/06/15 17:42:38 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:42:43 apps-1 nomad[1432]:     2016/06/15 17:42:43 [INFO] driver.docker: stopped container aff71e6f83b0113db88d6fdf322c4f5bf5f94778c4f68b088d7a527ff7d2117d
Jun 15 17:49:53 apps-1 systemd[1]: Stopping Nomad Agent...
Jun 15 17:49:53 apps-1 nomad[1432]: ==> Caught signal: interrupt
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: requesting shutdown
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] client.rpcproxy: shutting down
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: RPC failed to server 10.xxx.0.35:4647: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [ERR] client: failed to query for node allocations: rpc error: EOF
Jun 15 17:49:53 apps-1 nomad[1432]:     2016/06/15 17:49:53 [INFO] agent: shutdown complete
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Main process exited, code=exited, status=1/FAILURE
Jun 15 17:49:53 apps-1 systemd[1]: Stopped Nomad Agent.
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Unit entered failed state.
Jun 15 17:49:53 apps-1 systemd[1]: nomad.service: Failed with result 'exit-code'.
Jun 15 17:50:27 apps-1 systemd[1]: Started Nomad Agent.
Jun 15 17:50:27 apps-1 nomad[3491]:     Loaded configuration from /etc/nomad.d/nomad.hcl
Jun 15 17:50:27 apps-1 nomad[3491]: ==> Starting Nomad agent...

<<<<LOCK>>>>
Nomad just sits here. No ports opened, nothing. 
But If i stop the remaining docker container: 
$ docker stop fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489

<<<<LOCKDOWN IMMEDIATLY ENDS>>>> 

Jun 15 17:51:52 apps-1 nomad[3491]: ==> Nomad agent configuration:
Jun 15 17:51:52 apps-1 nomad[3491]:                  Atlas: <disabled>
Jun 15 17:51:52 apps-1 nomad[3491]:                 Client: true
Jun 15 17:51:52 apps-1 nomad[3491]:              Log Level: INFO
Jun 15 17:51:52 apps-1 nomad[3491]:                 Region: us (DC: us-central1)
Jun 15 17:51:52 apps-1 nomad[3491]:                 Server: false
Jun 15 17:51:52 apps-1 nomad[3491]: ==> Nomad agent started! Log data will stream in below:
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] client: using state directory /var/nomad/client
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] client: using alloc directory /var/nomad/alloc
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] fingerprint.cgroups: cgroups are available
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] fingerprint.consul: consul agent is available
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool ens4'
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] driver.docker: re-attaching to docker process: 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:50:27 [INFO] driver.docker: couldn't re-attach to the plugin process: error creating rpc client for executor plugin: Reattachment process not found
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [ERR] client: failed to open handle to task 'fluentd' for alloc '4fe5acc1-fcfa-d7d8-a25b-8459b42d8489': error creating rpc client for executor plugin: Reattachment process not fo
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: a container with the name fluentd-4fe5acc1-fcfa-d7d8-a25b-8459b42d8489 already exists; will attempt to purge and re-create
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: purged container 7207b7c63d284d35fcba38c97f0e362fad74aa64e8ba9b6db396d3a0324e9625
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: created container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [INFO] driver.docker: started container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:51:52 apps-1 nomad[3491]:     2016/06/15 17:51:52 [WARN] client: error fetching stats of task fluentd: stats collection hasn't started yet
Jun 15 17:51:58 apps-1 nomad[3491]:     2016/06/15 17:51:58 [INFO] driver.docker: stopped container b2d1ad00ceec817b694fd4aef12968527700828e363a324e0daed645406c0e63
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [ERR] driver.docker: failed to create container from image gcr.io/internal/internal-fluentd:emiter: no such image
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [ERR] client: failed to start task 'fluentd' for alloc '5f41d451-11e5-764c-5978-cde0b7b99567': Failed to create container from image gcr.io/internal/internal-fluentd:emiter
Jun 15 17:52:12 apps-1 nomad[3491]:     2016/06/15 17:52:12 [INFO] client: Not restarting task: fluentd for alloc: 5f41d451-11e5-764c-5978-cde0b7b99567

But the system job enters dead state and never recovers:

root@apps-1:/home/wuub# nomad status
ID                Type    Priority  Status
internal-fluentd  system  50        dead

@dadgar
Copy link
Contributor

dadgar commented Jun 16, 2016

Hey @grobinson-blockchain @a86c6f7964 @wuub,

So I tried everything you guys have said and still couldn't reproduce this. Restarted the client like 30 times. If one of you can reliably get into this state, can you issue a SIGQUIT (kill -3 <pid>) and give me the stack dump + what version of Nomad you are on (preferably the 0.4 - rc).

Thanks,
Alex

@wuub
Copy link
Contributor

wuub commented Jun 17, 2016

@dadgar I can repro this every time, so if you need me to try something else just let me know.

SIGQUIT goroutines dump is here, same nomad version as before (v0.4.0-dev ('bc09a0444722617a3a0ee0daa28d24b93d9d3e5b+CHANGES')

https://gist.github.com/wuub/9f1671d70ffb041dec8f2d9c77404437

System information:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04 LTS
Release:    16.04
Codename:   xenial

$ docker info 
Containers: 1
 Running: 1
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 1.11.2
Storage Driver: overlay
 Backing Filesystem: extfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge null host
Kernel Version: 4.4.0-24-generic
Operating System: Ubuntu 16.04 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 3.613 GiB
Name: apps-1
ID: [...]
Docker Root Dir: /var/lib/docker
Debug mode (client): false
Debug mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support

$ sudo apt show docker-engine
Package: docker-engine
Version: 1.11.2-0~xenial
Priority: optional
Section: admin
Maintainer: Docker <[email protected]>
Installed-Size: 73.4 MB
Depends: iptables, init-system-helpers (>= 1.18~), lsb-base (>= 4.1+Debian11ubuntu7), libapparmor1 (>= 2.6~devel), libc6 (>= 2.17), libdevmapper1.02.1 (>= 2:1.02.97), libltdl7 (>= 2.4.6), libseccomp2 (>= 2.1.0), libsystemd0
Recommends: aufs-tools, ca-certificates, cgroupfs-mount | cgroup-lite, git, xz-utils, apparmor
Conflicts: docker (<< 1.5~), docker-engine-cs, docker.io, lxc-docker, lxc-docker-virtual-package
Homepage: https://dockerproject.org
Download-Size: 14.5 MB
APT-Manual-Installed: yes
APT-Sources: https://apt.dockerproject.org/repo ubuntu-xenial/main amd64 Packages

Instance is created with following terraform :

variable "apps-servers" {
  default = 1
}


resource "google_compute_disk" "apps" {
  count = "${var.apps-servers}"
  name  = "apps-disk-${count.index + 1}"
  type  = "pd-ssd"
  zone  = "${lookup(var.instance_zones, count.index)}"
  image = "ubuntu-1604-xenial-v20160516a"
  size  = 10
}

resource "google_compute_instance" "apps" {
  count        = "${var.apps-servers}"
  name         = "apps-${count.index + 1}"
  machine_type = "n1-standard-1"
  zone         = "${lookup(var.instance_zones, count.index)}"
  tags         = ["apps"]

  disk {
    disk        = "${element(google_compute_disk.apps.*.name, count.index)}"
    auto_delete = false
  }

  network_interface {
    network       = "default"
    access_config = {
    }
  }

  can_ip_forward = true

  service_account {
    scopes = [
      "logging-write",
      "https://www.googleapis.com/auth/monitoring.write",
    ]
  }
}

@dadgar
Copy link
Contributor

dadgar commented Jun 17, 2016

@wuub Thanks! That stack trace had what we needed. Looks like it is hung calling docker stop. Will get the timeout to work properly. Though don't know why it is hung

@dadgar
Copy link
Contributor

dadgar commented Jun 17, 2016

@wuub So we make the following request to docker: POST /containers/<container-id>/stop?t=10. Could you do the same and see what happens when Nomad is in the hung state. Could potentially expose a docker engine bug

@wuub
Copy link
Contributor

wuub commented Jun 17, 2016

I reproduced nomad's hanged state, waited a while, then issued stop command. It took several seconds (expected, since fluentd does graceful shutdown that takes a while to complete), but ultimately docker retuned "204 No Content" and removed the container.
After that Nomad resumed boot almost immediately and correctly recreated fluentd container.

docker ps
CONTAINER ID        IMAGE                                                      COMMAND                  CREATED             STATUS              PORTS                                                                                                                                                  NAMES
51ba1f9dd0e3        gcr.io/internal/internal-fluentd:emiter              "/bin/sh -c fluentd"     5 minutes ago       Up 5 minutes        5140/tcp, 24220/tcp, 24231/tcp, 10.240.0.39:38270->9880/tcp, 10.240.0.39:38270->9880/udp, 10.240.0.39:20685->24224/tcp, 10.240.0.39:20685->24224/udp   fluentd-e64646d8-a538-7c3f-5c6e-7daddeacebb2

# curl --request POST -v --unix-socket /var/run/docker.sock  http:/containers/51ba1f9dd0e3af05fb4f8ca44d3e4261498ef5616d0b6d2ce23e3c31190fb340/stop?t=10
*   Trying /var/run/docker.sock...
* Connected to http (/var/run/docker.sock) port 80 (#0)
> POST /containers/51ba1f9dd0e3af05fb4f8ca44d3e4261498ef5616d0b6d2ce23e3c31190fb340/stop?t=10 HTTP/1.1
> Host: http
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 204 No Content
< Server: Docker/1.11.2 (linux)
< Date: Fri, 17 Jun 2016 21:44:23 GMT
< 
* Connection #0 to host http left intact

@wuub
Copy link
Contributor

wuub commented Jun 28, 2016

@dadgar I think I found the reason, after stracing nomad binary a few times I noticed something like this:

[pid  2537] write(5, "POST /containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 HTTP/1.1\r\nHost: unix.sock\r\nUser-Agent: go-dockerclient\r\nContent-Length: 0\r\nContent-Type: pl"..., 254) = 254

When I replayed it using curl, I was able to replicate inifnite lockup:

> root@apps-1:/home/wuub# curl --request POST -v --unix-socket /var/run/docker.sock  http:/containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 
*   Trying /var/run/docker.sock...
* Connected to http (/var/run/docker.sock) port 80 (#0)
> POST /containers/9386048b8f46e851416e7d3c6042a369f80ba5ff42477126534232e1597b2220/stop?t=5000000000000000000 HTTP/1.1
> Host: http
> User-Agent: curl/7.47.0
> Accept: */*
> 

After retrying out-of-band with curl & t=10 nomad job that was stuck in the background unlocked and booted quasi successfully.

I'll do some spelunking to find out why it wants to wait so long and let you know.

@wuub
Copy link
Contributor

wuub commented Jun 28, 2016

@dadgar
Pretty sure it's this line:
client.StopContainer(pid.ContainerID, uint(pid.KillTimeout*time.Second))

if e := client.StopContainer(pid.ContainerID, uint(pid.KillTimeout*time.Second)); e != nil {

the author probably had something like this in mind:
h.client.StopContainer(h.containerID, uint(h.killTimeout.Seconds()))

err := h.client.StopContainer(h.containerID, uint(h.killTimeout.Seconds()))

I'll test it in a minute and send PR if fixes my problem

wuub added a commit to wuub/nomad that referenced this issue Jun 28, 2016
Unit mismatch caused docker driver to wait almost indefinitely during boot 
(when one or more containers were a bit uncooperative during StopContainer())
This should fix problems described in  hashicorp#1202
@dadgar
Copy link
Contributor

dadgar commented Jun 28, 2016

@wuub if the problem is fixed we should close this. Let me know

@wuub
Copy link
Contributor

wuub commented Jun 29, 2016

@dadgar If I understand correctly this fix made it to 0.4.0?
@grobinson-blockchain it would be great if you could give it a try and confirm that it fixes your issues as well

@dadgar
Copy link
Contributor

dadgar commented Jun 29, 2016

Yes it made it into 0.4 was waiting for confirmation since I have never been able to reproduce

@wuub
Copy link
Contributor

wuub commented Jul 1, 2016

After more testing I can confirm that this is now solved for us :) Thanks

@ghost ghost closed this as completed Jul 1, 2016
@tarasvaskiv
Copy link

I fix this by restarting docker, sudo systemctl restart docker.service
this issue was fixed in #5568

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants