Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocations not stopping after modifying a job #6877

Closed
cjonesy opened this issue Dec 18, 2019 · 10 comments
Closed

Allocations not stopping after modifying a job #6877

cjonesy opened this issue Dec 18, 2019 · 10 comments
Assignees
Labels
theme/consul/connect Consul Connect integration type/bug

Comments

@cjonesy
Copy link
Contributor

cjonesy commented Dec 18, 2019

Nomad version

Nomad v0.10.2 (0d2d6e3)

Operating system and Environment details

CentOS Linux 7

Issue

When making certain changes to a jobfile and re-running the job, allocations can get stuck in a running state. I believe all allocations should be stopped when you run nomad stop example but that does not happen after some jobfile changes.

Reproduction steps

  • Kick off the job and wait for it to start: nomad run example.nomad
  • Now modify the job (see jobfile below): nomad run example_modified.nomad
  • Stop the job: nomad stop example
  • Notice one allocation is still running: nomad job status example
  • Restart the nomad client: sudo systemctl restart nomad
  • Notice the allocation is now stopped: nomad job status example

Job file (if appropriate)

Initial jobfile:

job "example" {
  datacenters = ["dal12"]

  group "redis1" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis1"
      port = 6379

      connect {
        sidecar_service {}
      }
    }

    task "redis1" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }

  group "redis2" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis2"
      port = 6379

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "example-redis1"
              local_bind_port  = 6378
            }
          }
        }
      }
    }

    task "redis2" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }
}

Modified Jobfile

(added port to redis1 network, then referenced it in service)

job "example" {
  datacenters = ["dal12"]

  group "redis1" {
    network {
      mode = "bridge"
      port "tcp" {}
    }

    service {
      name = "example-redis1"
      port = "tcp"

      connect {
        sidecar_service {}
      }
    }

    task "redis1" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }

  group "redis2" {
    network {
      mode = "bridge"
    }

    service {
      name = "example-redis2"
      port = 6379

      connect {
        sidecar_service {
          proxy {
            upstreams {
              destination_name = "example-redis1"
              local_bind_port  = 6378
            }
          }
        }
      }
    }

    task "redis2" {
      driver = "docker"

      config {
        image = "redis:5.0.5"
      }

      resources {
        cpu    = 500 # MHz
        memory = 1024 # MB
      }
    }
  }
}

Nomad Client logs (if appropriate)

Nomad Client Node 1

Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.664-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=connect-proxy-example-redis2 @module=logmon path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.connect-proxy-example-redis2.stderr.fifo timestamp=2019-12-18T15:40:01.664-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stdout.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.665-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=a195a6e4-cded-eea5-16b1-f6c5a9291588 task=redis2 path=/data/nomad/alloc/a195a6e4-cded-eea5-16b1-f6c5a9291588/alloc/logs/.redis2.stderr.fifo @module=logmon timestamp=2019-12-18T15:40:01.665-0600
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.717-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.789-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.933-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=26475e14c95b71f02d9cba75eb22ed9fb0b57e0fdf722516e50b4e136af897dc
Dec 18 15:40:01 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:01.982-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=1b2760e87124fc0cf90b338bc6c9b9603a78fff774a4ee8ea640a57e9a5d3bc9
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:27.077-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.078-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:27.078-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.079-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/88e30d68-bb6f-e532-0605-e9c83644c312/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:27.079-0600
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.124-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.198-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.326-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=ebdf5b40e60ee4a577c97e84806d27e96e0f8e13326d603ca2bf963caa88d469
Dec 18 15:40:27 <node1_fqdn> nomad[98743]: 2019-12-18T15:40:27.393-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=625b58f8da98cd4e97b2d9a9803d20d801e3e2a69c6808d3daa23ad64392a820
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.732-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.737-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312 task=redis1 error="container stopped"
Dec 18 15:41:03 <node1_fqdn> nomad[98743]: 2019-12-18T15:41:03.749-0600 [INFO]  client.gc: marking allocation for GC: alloc_id=88e30d68-bb6f-e532-0605-e9c83644c312

Nomad Client Node 3

Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.connect-proxy-example-redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stdout.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.497-0600 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 @module=logmon path=/data/nomad/alloc/14be6c9c-01b3-2ab8-717c-77486c64cbed/alloc/logs/.redis1.stderr.fifo timestamp=2019-12-18T15:40:01.497-0600
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.539-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.603-0600 [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.697-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=85b298db651344f50999b199a28260b57376211aea2ea2bf2efe4b177488516c
Dec 18 15:40:01 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:01.772-0600 [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=304271a6074ee19e068579bc67dccab7ffa29d104c9748ea6e7b6733c62c3f83
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.861-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=connect-proxy-example-redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.920-0600 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed task=redis1 error="container stopped"
Dec 18 15:40:25 <node3_fqdn> nomad[4507]: 2019-12-18T15:40:25.928-0600 [INFO]  client.gc: marking allocation for GC: alloc_id=14be6c9c-01b3-2ab8-717c-77486c64cbed

Nomad Server logs (if appropriate)

Nothing of interest was logged on the servers during this time.

@tgross tgross self-assigned this Dec 19, 2019
@tgross
Copy link
Member

tgross commented Dec 19, 2019

Hi @cjonesy and thanks for reporting this! I was able to reproduce using these job files in a local development environment.

After running the job update:

▶ nomad job status example
ID            = example
Name          = example
Submit Date   = 2019-12-19T09:09:47-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
redis1      0       0         1        0       1         0
redis2      0       0         1        0       0         0

Latest Deployment
ID          = e8799bae
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
redis1      1        1       0        0          2019-12-19T14:19:47Z
redis2      1        1       0        0          2019-12-19T14:19:47Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
270c6c11  a804c158  redis1      1        run      running   10s ago    8s ago
c95b4e06  a804c158  redis1      0        stop     complete  2m13s ago  9s ago
eb69149c  a804c158  redis2      1        run      running   2m13s ago  10s ago

▶ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
d53063fd75b1        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   35 seconds ago      Up 34 seconds                           connect-proxy-example-redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
9da1af1c2135        redis:5.0.5                                "docker-entrypoint.s…"   35 seconds ago      Up 34 seconds                           redis1-270c6c11-e057-565e-fa40-a86a3e6bf670
0903c5986424        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 36 seconds ago      Up 34 seconds                           nomad_init_270c6c11-e057-565e-fa40-a86a3e6bf670
446990b5a95a        redis:5.0.5                                "docker-entrypoint.s…"   2 minutes ago       Up 2 minutes                            redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   2 minutes ago       Up 2 minutes                            connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 2 minutes ago       Up 2 minutes                            nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0

After running the job stop:

▶ nomad job status example
ID            = example
Name          = example
Submit Date   = 2019-12-19T09:09:47-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = dead (stopped)
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
redis1      0       0         0        0       2         0
redis2      0       0         1        0       0         0

Latest Deployment
ID          = e8799bae
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
redis1      1        1       1        0          2019-12-19T14:19:59Z
redis2      1        1       1        0          2019-12-19T14:19:57Z

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
270c6c11  a804c158  redis1      1        stop     complete  1m58s ago  45s ago
c95b4e06  a804c158  redis1      0        stop     complete  4m1s ago   1m57s ago
eb69149c  a804c158  redis2      1        stop     running   4m1s ago   45s ago

▶ docker ps
CONTAINER ID        IMAGE                                      COMMAND                  CREATED             STATUS              PORTS               NAMES
446990b5a95a        redis:5.0.5                                "docker-entrypoint.s…"   3 minutes ago       Up 3 minutes                            redis2-eb69149c-0a59-f550-05b9-f021a8158de0
43ea68684b33        envoyproxy/envoy:v1.11.2                   "/docker-entrypoint.…"   3 minutes ago       Up 3 minutes                            connect-proxy-example-redis2-eb69149c-0a59-f550-05b9-f021a8158de0
fb589efa3afc        gcr.io/google_containers/pause-amd64:3.0   "/pause"                 3 minutes ago       Up 3 minutes                            nomad_init_eb69149c-0a59-f550-05b9-f021a8158de0

Relevant window of the logs:

    2019-12-19T14:10:59.805Z [DEBUG] http: request complete: method=DELETE path=/v1/job/example?purge=false duration=467.067µs
    2019-12-19T14:10:59.805Z [DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default results="Total changes: (place 0) (destructive 0) (inplace 0) (stop 2)
Desired Changes for "redis1": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Desired Changes for "redis2": (place 0) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)"
    2019-12-19T14:10:59.806Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=85.613µs
    2019-12-19T14:10:59.807Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=70.672µs
    2019-12-19T14:10:59.807Z [DEBUG] worker: submitted plan for evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
    2019-12-19T14:10:59.807Z [DEBUG] worker.service_sched: setting eval status: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a job_id=example namespace=default status=complete
    2019-12-19T14:10:59.807Z [DEBUG] worker: updated evaluation: eval="<Eval "39c2244c-b8b4-5f9a-410f-b01a372d720a" JobID: "example" Namespace: "default">"
    2019-12-19T14:10:59.807Z [DEBUG] worker: ack evaluation: eval_id=39c2244c-b8b4-5f9a-410f-b01a372d720a
    2019-12-19T14:10:59.808Z [DEBUG] client: updated allocations: index=43 total=3 pulled=2 filtered=1
    2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates: added=0 removed=0 updated=2 ignored=1
    2019-12-19T14:10:59.808Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=2 ignored=1 errors=0
    2019-12-19T14:10:59.916Z [DEBUG] client: updated allocations: index=45 total=3 pulled=1 filtered=2
    2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:10:59.916Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:10:59.916Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:10:59.979Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 error="container stopped"
    2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6262
    2019-12-19T14:10:59.982Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2019-12-19T14:10:59.985Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:72e91d8680d853b874d9aedda3a4b61048630d2043dd490ff36f5b0929f69874 references=3
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1 path=/opt/gopath/bin/nomad pid=6069
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
    2019-12-19T14:10:59.986Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=connect-proxy-example-redis1
    2019-12-19T14:11:00.062Z [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task with unrecoverable error: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 error="container stopped"
    2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin process exited: driver=docker path=/opt/gopath/bin/nomad pid=6229
    2019-12-19T14:11:00.064Z [DEBUG] client.driver_mgr.docker.docker_logger: plugin exited: driver=docker
    2019-12-19T14:11:00.069Z [DEBUG] client.driver_mgr.docker: image id reference count decremented: driver=docker image_id=sha256:63130206b0fa808e4545a0cb4a1f14f6d40b8a7e2e6fda0a31fd326c2ac0971c references=2
    2019-12-19T14:11:00.069Z [INFO]  client.gc: marking allocation for GC: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1 path=/opt/gopath/bin/nomad pid=6068
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner.task_hook.logmon: plugin exited: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
    2019-12-19T14:11:00.071Z [DEBUG] client.alloc_runner.task_runner: task run loop exiting: alloc_id=270c6c11-e057-565e-fa40-a86a3e6bf670 task=redis1
    2019-12-19T14:11:00.120Z [DEBUG] client: updated allocations: index=46 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.120Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.120Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.316Z [DEBUG] client: updated allocations: index=47 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.316Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.316Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.337Z [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=2 registered_checks=0 deregistered_checks=0
    2019-12-19T14:11:00.519Z [DEBUG] client: updated allocations: index=48 total=3 pulled=1 filtered=2
    2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates: added=0 removed=0 updated=1 ignored=2
    2019-12-19T14:11:00.519Z [DEBUG] client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43
    2019-12-19T14:11:00.519Z [DEBUG] client: allocation updates applied: added=0 removed=0 updated=1 ignored=2 errors=0
    2019-12-19T14:11:00.813Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a duration=189.924µs
    2019-12-19T14:11:00.814Z [DEBUG] http: request complete: method=GET path=/v1/evaluation/39c2244c-b8b4-5f9a-410f-b01a372d720a/allocations duration=88.981µs

This log entry looks promising for figuring out what happened here: client.alloc_runner: Discarding allocation update: alloc_id=eb69149c-0a59-f550-05b9-f021a8158de0 skipped_modify_index=43 new_modify_index=43

@tgross tgross added this to the near-term milestone Dec 19, 2019
@tgross
Copy link
Member

tgross commented Dec 19, 2019

With a bit of debugging I've determined at this point that the task runner loop itself seems to be ok, and the scheduler is sending the update w/ a desired status of stop to the allocation as we'd expect.

The "discarding allocation update" log message is coming from alloc_runner.go#L697. When we get an update for an allocation, we check to make sure we don't have one queued up that's more recent. In this case, we find that there's one queued up with the same modify index as the one we have, which is definitely a clue.

What I've found is that when we submit the update for stop, it's never being dequeued, and the reason for that is because we never returned from (ar *allocRunner) handleAllocUpdate from when we made the first update. This explains why we only see the behavior after changes.

With that information in hand and some trace-level debugging I've narrowed it down to when we call into (h *groupServiceHook) Update, we never come back out. I suspect that this may be related to #6459 in some way. (cc @shoenig as a heads up)

@tgross tgross added theme/consul/connect Consul Connect integration and removed theme/client labels Dec 19, 2019
@tgross tgross modified the milestones: near-term, 0.10.3 Dec 20, 2019
@tgross
Copy link
Member

tgross commented Dec 20, 2019

After a quick chat w/ @shoenig I've assigned it to him to dig into further while working on #6459

@tgross tgross removed their assignment Dec 20, 2019
@tommyalatalo
Copy link

Just chiming in here and seconding this issue. This happens even with the Consul Connect example wiht the count-api and count-dashboard, and I've had this happen with other jobs as well.

Not only do allocations not always stop, but the alloc directory in nomad's folder structure remains, which seems to make Nomad re-register any services to consul as well, making it impossible to get rid of it. I also tried stopping Nomad and forcefully rm -rf:ing that alloc folder on the host, but the /secrets folder in that directory is busy and the command exits with an error "Device or resource busy: .../secrets"

shoenig added a commit that referenced this issue Feb 19, 2020
Fix a bug where consul service definitions would not be updated if changes
were made to the service in the Nomad job. Currently this only fixes the
bug for cases where the fix is a matter of updating consul agent's service
registration. There is related bug where destructive changes are required
(see #6877) which will be fixed in another PR.

The enable_tag_override configuration setting for the parent service is
applied to the sidecar service.

Fixes #6459
@shoenig shoenig modified the milestones: 0.10.4, 0.11.0 Feb 20, 2020
@schmichael schmichael removed this from the 0.11.0 milestone Apr 9, 2020
@exFalso
Copy link

exFalso commented Jun 15, 2020

We just ran into this, this is really bad. The declared/reported state should always always match the runtime state.. isn't this the point of nomad?
Instead, we're seeing allocations leaking, and even nomad agent restarts do not remove them. These allocations are not reported in nomad status, you cannot query them with nomad alloc status, they persist through kills and are kept around until we forcefully remove the corresponding allocation directories(!!). Also, we're not using Connect, this is a general nomad bug.

I had a short look at the implementation, it seems there are several sources of truth for whether an allocation should exist(bad!), and these leaking allocations are not cleaned up from the boltdb state properly.

So far I've been a fan of nomad, but leaking processes... we don't have many agents atm so we can clean up the mess manually, but what will happen when we scale to hundreds of nodes?

@exFalso
Copy link

exFalso commented Jul 1, 2020

Small update: clearing the boltdb entries (while the agent is stopped ofc) also works, which is less pesky than trying to remove the allocation directories with busy mounted volumes (they will be picked up by gc).
The boltdb file is at <agent_dir>/client/state.db

@marvreichmann
Copy link

i have the same problem, but no access to the nodes themselves. is there any way to solve this without restarting the client?

@shoenig
Copy link
Member

shoenig commented Jul 10, 2020

@cjonesy I'm able to reproduce this with a version as late as v0.10.5, but not with a version v0.11.0 or later (which coincides with the fix that went in for #6459). Are you able to upgrade to a recent version of Nomad and confirm whether the problem persists?

@cjonesy
Copy link
Contributor Author

cjonesy commented Jul 15, 2020

@shoenig We recently upgraded to 0.11.3 and the problem no longer persists. Sounds like #6459 has indeed fixed the issue.

Thank you!

@github-actions
Copy link

github-actions bot commented Nov 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
theme/consul/connect Consul Connect integration type/bug
Projects
None yet
Development

No branches or pull requests

7 participants