Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sometimes template stanza does not reload allocations #22224

Open
OfficerKoo opened this issue May 23, 2024 · 5 comments
Open

Sometimes template stanza does not reload allocations #22224

OfficerKoo opened this issue May 23, 2024 · 5 comments

Comments

@OfficerKoo
Copy link

Nomad version

nomad_1.6.1

Operating system and Environment details

AWS, Amazon Linux 2, Production

Issue

We use Vault Nomad integration for database secrets in our application, sometimes Nomad re-renders the secrets in file, but fails to trigger restart, so env is not provisioned.

Usually 1 allocation from the task is affected, all jobs that use database secret engine are affected. restart fixes the problem.

Issue persists only on production, we failed to reproduce it on staging cluster

We checked some other issues, such as #6112 , but it does not fit our profile, as we don't have out jobs depending on other jobs, just straight up using vault database engine.

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Sample job, but all the jobs that use database secrets are affected

job "analytics-svc" {
  namespace   = "production"
  type        = "service"
  datacenters = ["redacted"]

  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = true
    canary            = 0
  }

  migrate {
    max_parallel     = 1
    health_check     = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "analytics" {
    count = 2

    network {
      port "http" {
        to = 80
      }
    }

    scaling {
      enabled = true
      min     = local.scaled_min
      max     = 2

      policy {
        cooldown            = "2m"
        evaluation_interval = "1m"

        check "avg_allocs_cpu_usage_percentage" {
          source = "prometheus"
          query  = "avg(avg_over_time(nomad_client_allocs_cpu_total_ticks{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) / avg(avg_over_time(nomad_client_allocs_cpu_allocated{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) * 100"

          strategy "target-value" {
            target = 70
          }
        }

        check "avg_allocs_memory_usage_percentage" {
          source = "prometheus"
          query  = "avg(avg_over_time(nomad_client_allocs_memory_usage{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) / avg(avg_over_time(nomad_client_allocs_memory_allocated{exported_job=~\"analytics-svc\", namespace=~\"production\"}[2m])) * 100"

          strategy "target-value" {
            target = 70
          }
        }

      }
    }


    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }

    task "service" {
      vault {
        policies = ["service"]
      }

      driver = "docker"

      config {
        image      = "redacted"
        force_pull = true

        auth {
          username       = "redacted"
          password       = "redacted"
          server_address = "registry.gitlab.com"
        }

        ports = ["http"]

        labels {
          environment = "production"
          service     = "${NOMAD_GROUP_NAME}"
        }

        logging {
          type = "fluentd"

          config {
            tag                   = "${NOMAD_TASK_NAME}.${NOMAD_JOB_NAME}"
            labels                = "environment,service"
            fluentd-address       = "tcp://127.0.0.1:9997"
            fluentd-async-connect = true
          }
        }
      }

      env {
        APP_ALLOWED_ORIGINS = "*"
        ROUTING_URL         = "redacted"
        APOLLO_GRAPH_REF    = "redacted"
        DB_HOST             = "redacted"
        DB_DATABASE         = "analytics"
        DB_STAGING_DATABASE = "analytics_staging"
      }

      template {
        change_mode = "restart"
        destination = "secrets/.env"
        env         = true
        splay       = "1m"
        wait {
          min = "1m"
        }

        data = <<EOF
{{ with secret "kv/data/ci/prod/env" }}
AUTH0_AUDIENCE={{ .Data.data.AUTH0_AUDIENCE }}
AUTH0_DOMAIN={{ .Data.data.AUTH0_DOMAIN }}
APOLLO_KEY={{ .Data.data.APOLLO_KEY }}{{end}}

{{ with secret "kv/data/ci/production/m2m" }}
MACHINE_AUTH_TOKEN={{ .Data.data.MACHINE_AUTH_TOKEN }}{{ end }}

{{ with secret "prod-database/creds/analytics" }}
DB_USER={{ .Data.username }}
DB_PASSWORD={{ .Data.password }}{{ end }}

EOF
      }

      resources {
        cpu    = 400
        memory = 250
      }

      service {
        name = "${NOMAD_JOB_NAME}"
        port = "http"

        tags = [
          "${NOMAD_NAMESPACE}",
          "${NOMAD_GROUP_NAME}",
          "${NOMAD_TASK_NAME}",
          "urlprefix-/production/${NOMAD_GROUP_NAME} strip=/production/${NOMAD_GROUP_NAME}",
          "scrape_service",
        ]

        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

2024-05-21 14:36:07.621	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 14:36:07.620	
2024-05-21T12:36:07.620Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 14:36:05.623	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 14:36:05.622	
2024-05-21T12:36:05.622Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 14:36:05.617	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 14:36:05.617	
2024-05-21T12:36:05.615Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 14:36:05.616	
2024-05-21T12:36:05.615Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 14:36:05.616	
 client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 14:36:05.601	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 14:36:05.600	
2024-05-21T12:36:05.600Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 14:36:00.214	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="User requested running tasks to restart" failed=false
2024-05-21 14:36:00.213	
2024-05-21T12:36:00.211Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="User requested running tasks to restart" failed=false
2024-05-21 13:13:04.863	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 13:13:04.862	
2024-05-21T11:13:04.862Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Started msg="Task started by client" failed=false
2024-05-21 13:13:03.624	
 agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:13:03.623	
2024-05-21T11:13:03.623Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:13:02.395	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 13:13:02.395	
2024-05-21T11:13:02.395Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Driver msg="Downloading image" failed=false
2024-05-21 13:13:02.389	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 13:13:02.389	
2024-05-21T11:13:02.387Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Restarting msg="Task restarting in 0s" failed=false
2024-05-21 13:13:02.388	
2024-05-21T11:13:02.386Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 13:13:02.388	
 client.alloc_runner.task_runner: restarting task: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service reason="" delay=0s
2024-05-21 13:13:02.359	
2024-05-21T11:13:02.357Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 13:13:02.358	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type=Terminated msg="Exit Code: 137, Exit Message: \"Docker container exited with non-zero exit code: 137\"" failed=false
2024-05-21 13:12:56.854	
 client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false
2024-05-21 13:12:56.854	
2024-05-21T11:12:56.853Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=11145863-fb01-daea-f3dd-607a2a5ea73a task=service type="Restart Signaled" msg="Template with change_mode restart re-rendered" failed=false
2024-05-21 13:12:45.309	
 agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
2024-05-21 13:12:45.309	
2024-05-21T11:12:45.308Z [INFO]  agent: (runner) rendered "(dynamic)" => "/opt/nomad/data/alloc/11145863-fb01-daea-f3dd-607a2a5ea73a/service/secrets/.env"
@tgross
Copy link
Member

tgross commented May 23, 2024

@OfficerKoo your logs are mangled such that timestamps don't line up (and in reverse). But it looks like they're showing the template runner restarted the task at 2024-05-21T11:12:56.853Z, but then it was still trying to download the Docker image to start the task when the next re-render happened at 2024-05-21T11:13:03.623Z. As in, it didn't restart the task for the second template re-render simply because it hadn't started it from a few seconds before yet.

Then I see a gap of 2 hours before there's a user-initiated restart. Most of what I can tell here from the logs looks right... Nomad can't restart a task that hasn't yet started. If I'm missing something, it might help if you could show this with debug-level logs that have consistent timestamps.

@OfficerKoo
Copy link
Author

@tgross sorry about that, looks like logs got all twisted up when i were exporting them. You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.

We will try to enable debug logs on one client, but this will happen next week probably.

@tgross
Copy link
Member

tgross commented May 24, 2024

You are right, looks like second re-render happens a second after Docker image downloading, what can we do to circumvent this? I already tried to add 1 minute splay and minimum 1 minute wait.

Ok, just wanted to make sure I was reading that correctly. The splay should help with this, so long as the splay length is longer than the window between the two re-renders + the task start.

Taking a quick pass over the code it looks like we send the Restart method and then immediately return to the loop that's watching for re-renders. That Restart method returns an error if the task isn't running... but the template re-render loop ignores the error so it's never reported! So at the very least there's a bug where we should be logging the error so that you as the operator know what the heck is going on. I suspect this may have been changed at some point, because now that I'm seeing this all together this looks like you could be running into what was described in #5459

But fixing the log wouldn't fix the underlying problem... we could drop re-render events such that the task start could race with any environment set by ``template.env = true (as you've done here). I suspect we want to queue-up re-render events that we get from the template runner, ideally coalescing any that were received in the same restart window.

We will try to enable debug logs on one client, but this will happen next week probably.

Thanks! If you look at the logs and see anything you feel is too sensitive to share publicly, you can email them to [email protected], which is only visible to members of the Nomad engineering team.

@ygersie
Copy link
Contributor

ygersie commented Jun 13, 2024

Hate to hijack the issue but I've seen an issue with workloads not receiving a signal after updates occasionally as well. In my case what happens is that I have a workload with dynamic certs. When Nomad agent is restarted templates with dynamic certs are always refreshed, but in some cases the signal event isn't triggered.

I see in the log:

2024-06-03T14:47:15.810Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task-1 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:15.810Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=task2 type=Received msg="Task received by client" failed=false
2024-06-03T14:47:16.965Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-1/secrets/server.key"
2024-06-03T14:47:16.976Z [INFO]  agent: (runner) rendered "(dynamic)" => "/data/podman/root/nomad/data/alloc/f8c245e8-c418-dc35-b3b3-43f558a41d74/task-2/secrets/server.crt"

As you can see I'm missing the expected re-rendered msg:

[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f8c245e8-c418-dc35-b3b3-43f558a41d74 task=precursor type=Signaling msg="Template re-rendered" failed=false

It doesn't happen often but often enough to occasionally break production workloads due to expired certificates. I also haven't been able to reproduce by restarting the same agent that had this issue before.

The job template splay is set to 0 so under normal circumstances a re-render would immediately trigger a signal.

@OfficerKoo
Copy link
Author

@tgross Hi, so we enabled debug logs, but didn't catch anything suspicious, it does look that error happens when container restarts, but only suspicious part in logs is

2024-06-12 18:09:49.063	agent: (runner) checking template 8d86a9f697ea6be20429ce0f1cffc9ae
2024-06-12 18:09:49.063	agent: (runner) missing data for 4 dependencies
2024-06-12 18:09:49.064	agent: (runner) missing dependency: vault.read(kv/data/ci/prod/env)
2024-06-12 18:09:49.064	agent: (runner) missing dependency: vault.read(kv/data/ci/production/m2m)
2024-06-12 18:09:49.064	agent: (runner) missing dependency: vault.read(prod-database/creds/selectors)
2024-06-12 18:09:49.064	agent: (runner) missing dependency: vault.read(prod-database/creds/uploads)
2024-06-12 18:09:49.067	agent: (runner) add used dependency vault.read(kv/data/ci/prod/env) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070	agent: (runner) add used dependency vault.read(kv/data/ci/production/m2m) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.070	agent: (runner) add used dependency vault.read(prod-database/creds/selectors) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071	agent: (runner) add used dependency vault.read(prod-database/creds/uploads) to missing since isLeader but do not have a watcher
2024-06-12 18:09:49.071	agent: (runner) was not watching 4 dependencies

Which is happens right after final config is rendered.
I also noticed that in final config both wait and splay set to 0 or disabled, which is 100% not the case in job definition
I attached full logs, allocation id is 63378fde-1456-96b8-2fc5-09cbbb288624, template id is 8d86a9f697ea6be20429ce0f1cffc9ae
Explore-logs-2024-06-14 16_12_32.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants