Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad not cleaning consul services/healthcheck after failed deploy [0.9.x] #5770

Closed
MagicMicky opened this issue Jun 3, 2019 · 8 comments · Fixed by #5821
Closed

Nomad not cleaning consul services/healthcheck after failed deploy [0.9.x] #5770

MagicMicky opened this issue Jun 3, 2019 · 8 comments · Fixed by #5821

Comments

@MagicMicky
Copy link

Nomad version

Noticed on 0.9.0. Problem present on 0.9.2-rc1 and 0.9.1 as well

nomad version
Nomad v0.9.2-rc1 (98e75251316173bd2e6f6392dceea5f2d2182e17)

Operating system and Environment details

Running on Ubuntu
Nomad single node, both client and server
Consul single node, both client and server (v1.4.4) - tested with v1.5.1 as well
Vault single node (v1.1.0)
Tried with and without ACLs enabled (both nomad and consul)

Issue

When provisioning a job, that contains a template, referencing to a vault secret, the following actions happen:

  • Deployments start
  • job gets scheduled, task gets allocated
  • Service gets registered in consul
  • Nomad deployment fails, with error:
Template failed: vault.read(aws/creds/my-creds): vault.read(aws/creds/my-creds): Error making API request. URL: GET http://127.0.0.1:8200/v1/aws/creds/my-creds Code: 403. Errors: * 1 error occurred: * permission denied
  • Healthcheck is still registered in consul, failing

Reproduction steps

  • Provide a cluster with a running nomad, consul and vault
  • Run nomad, ensure it's registered properly in consul
  • Try nomad run my-job.nomad (see file given below).
  • You should see nomad registering a first healthcheck as failing in consul (all good here)
  • Allocation fails
  • See consul services, that still lists the service with a failing healthchecks.

Job file (if appropriate)

This has been tested and reproduced with the following job file:

job "my-task2" {
  datacenters = ["mickael"]
  type = "service"
  group "my-task2" {
    count = 1
    task "my-task" {
      driver = "docker"
      shutdown_delay = "10s"
      resources {
        memory = 1500
        cpu = 2000
        network {
          port "service_port" {
          }
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH
        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        image        = "nginx"
        port_map {
          service_port = 80
        }
        labels {
          build_id = "1"
        }
        dns_servers = [
          "172.17.0.1"
        ]
      }

      service {
        name = "my-task"
        tags = [
                 "nomad",
                 "my-task",
          ]
        port = "service_port"
        check {
          name     = "serviceHealth_my-task"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = [
                    "nomad_my-task",
                   ]
        change_mode = "restart"
      }
    }
  }
}

Also reproduced with a custom upgrade/reschedule policy, which lead to two allocation in total, and to two failing healthchecks registered in consul. (We could match the consul HC with nomad allocations via the port registered.)

    reschedule {
        attempts       = 15
        interval       = "30m"
        delay          = "5s"
        delay_function = "exponential"
        max_delay      = "180s"
        unlimited      = false
    }

    update {
      max_parallel = 1
      min_healthy_time = "10s"
      healthy_deadline = "570s"
      progress_deadline = "575s"
      auto_revert = true
    }

Logs

Not much information was found in the logs.
Same for consul, logs are clean.
nomad.log

Consul configured:

  • With/without ACLs whitelist mode (tested both)
  • with encryption

Nomad configured:

  • With/Without ACLs (tested both)
  • TLS enabled
  • with encryption
  • vault and consul integration enabled
@jorgemarey
Copy link
Contributor

jorgemarey commented Jun 6, 2019

Hi, this happened to me too. I was able to reproduce it consistently by doing the following:

# Start vault server (tested with v1.1.0)
$ vault server -dev -dev-root-token-id=root

# Start consul server (tested with v1.5.1)
$ consul agent -dev

# Start nomad server and client (tested with 0.9.2)
$ nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200

Use a job (similar to the one provided by @MagicMicky

job "my-job" {
  datacenters = ["dc1"]
  type        = "service"

  group "my-group" {
    count = 1

    task "my-task" {
      driver         = "raw_exec"
      shutdown_delay = "10s"

      resources {
        memory = 150
        cpu    = 200

        network {
          port "service" {}
        }
      }

      template {
        data = <<EOH
          {{ with secret "aws/creds/my-task" }}
AWS_ACCESS_KEY_ID={{ .Data.access_key }}
AWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}
          {{ end }}
        EOH

        destination = "secrets/creds.env"
        env         = true
        change_mode = "restart"
      }

      config {
        command = "echo"
        args    = ["${NOMAD_PORT_service}"]
      }

      service {
        name = "my-task-service"
        tags = ["nomad", "my-task"]
        port = "service"

        check {
          name     = "serviceHealth_my-task-failing"
          type     = "http"
          path     = "/"
          interval = "10s"
          timeout  = "2s"
        }
      }

      vault {
        policies = ["nomad_my-task"]
        change_mode = "restart"
      }
    }
  }
}
# Run the job
nomad run job.hcl

In the first run it won't get inserted in consul, so it works fine, but if we change something in the job (like the memory for example) and run it again, it will be registered and remain there.

failing

If we keep modifiing something that makes the task run again, new checks appears

failing2

@jorgemarey
Copy link
Contributor

Hi again, I launched nomad with TRACE logging and saw the following:

nomad agent -dev -vault-enabled -vault-token=root -vault-address=http://127.0.0.1:8200 -log-level=trace | grep consul
# nomad run job.hcl

2019/06/06 12:14:45.771551 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/db9dba9b-8348-505d-0a9b-148576a25fb4/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# modify job and  nomad run

    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services start="2019-06-06 12:15:09.461388264 +0200 CEST m=+103.632854633"
    2019-06-06T12:15:09.461+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=db9dba9b-8348-505d-0a9b-148576a25fb4 task=my-task name=consul_services end="2019-06-06 12:15:09.461460589 +0200 CEST m=+103.632926941" duration=72.308µs
    2019-06-06T12:15:09.465+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0
    2019/06/06 12:15:09.499951 [DEBUG] (runner) final config: {"Consul":{"Address":"","Auth":{"Enabled":false,"Username":"","Password":""},"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":true},"Token":"","Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000}},"Dedup":{"Enabled":false,"MaxStale":2000000000,"Prefix":"consul-template/dedup/","TTL":15000000000},"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":0},"KillSignal":2,"LogLevel":"WARN","MaxStale":2000000000,"PidFile":"","ReloadSignal":1,"Syslog":{"Enabled":false,"Facility":"LOCAL0"},"Templates":[{"Backup":false,"Command":"","CommandTimeout":30000000000,"Contents":"          {{ with secret \"aws/creds/my-task\" }}\nAWS_ACCESS_KEY_ID={{ .Data.access_key }}\nAWS_SECRET_ACCESS_KEY={{ .Data.secret_key }}\n          {{ end }}\n        ","CreateDestDirs":true,"Destination":"/tmp/NomadClient772661944/f70e1b93-14a8-2091-5b46-ef1d1b962501/my-task/secrets/creds.env","ErrMissingKey":false,"Exec":{"Command":"","Enabled":false,"Env":{"Blacklist":[],"Custom":[],"Pristine":false,"Whitelist":[]},"KillSignal":2,"KillTimeout":30000000000,"ReloadSignal":null,"Splay":0,"Timeout":30000000000},"Perms":420,"Source":"","Wait":{"Enabled":false,"Min":0,"Max":0},"LeftDelim":"{{","RightDelim":"}}"}],"Vault":{"Address":"http://127.0.0.1:8200","Enabled":true,"Grace":15000000000,"Namespace":"","RenewToken":false,"Retry":{"Attempts":12,"Backoff":250000000,"MaxBackoff":60000000000,"Enabled":true},"SSL":{"CaCert":"","CaPath":"","Cert":"","Enabled":false,"Key":"","ServerName":"","Verify":false},"Transport":{"DialKeepAlive":30000000000,"DialTimeout":30000000000,"DisableKeepAlives":false,"IdleConnTimeout":90000000000,"MaxIdleConns":100,"MaxIdleConnsPerHost":5,"TLSHandshakeTimeout":10000000000},"UnwrapToken":false},"Wait":{"Enabled":false,"Min":0,"Max":0}}

# nomad stop -purge job

    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: running update hook: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services start="2019-06-06 12:15:47.406246758 +0200 CEST m=+141.577713188"
    2019-06-06T12:15:47.406+0200 [TRACE] client.alloc_runner.task_runner: finished update hooks: alloc_id=f70e1b93-14a8-2091-5b46-ef1d1b962501 task=my-task name=consul_services end="2019-06-06 12:15:47.40648447 +0200 CEST m=+141.577950904" duration=237.716µs
    2019-06-06T12:15:47.416+0200 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=1 deregistered_checks=0

In the end the service is still registered with 2 checks.

failing

I guess this has to do with the service updatehook being called.

@aphill70
Copy link

aphill70 commented Jun 7, 2019

We are seeing this as well

@toadstule
Copy link

We were able to reproduce this behavior as well using this simple job:

job "test" {
    region = "a1"
    datacenters = ["a1"]
    type = "service"
    group "tg" {
        count = 1
        task "tt" {
            driver = "docker"
            config {
                image = "does/not/exist"
            }
            service {
                port = "http"
            }
            resources {
                network {
                    port "http" {}
                }
            }
            vault {
                policies = ["doesnotexist"]
            }
        }
    }
}

The docker image does not exist, so the job continues to restart. With each restart, we get another service entry in Consul. (Because there is no defined check, they all show up as healthy services in Consul.)

nomad job stop -purge test will stop the job and remove it from Nomad, but the service entries remain in Consul.

The problem does not show up if we remove the vault stanza -- even though the policy doesn't exist and we never try to access a Vault secret.

Service entries in Consul are cleaned up if the Nomad client is restarted on the node where the job was attempting to run.

Screen Shot 2019-06-07 at 2 00 17 PM

Screen Shot 2019-06-07 at 2 00 26 PM

endocrimes added a commit that referenced this issue Jun 12, 2019
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
@toadstule
Copy link

@endocrimes -- thank you so much!

@mikeblum
Copy link

Is this merged into the 0.9.4 RC?

@toadstule
Copy link

This was released in 0.9.3 (https://github.com/hashicorp/nomad/blob/master/CHANGELOG.md#093-june-12-2019)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants