Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

periodic.prohibit_overlap has unexpected behavior with blocked templating #12478

Closed
tgross opened this issue Apr 6, 2022 · 5 comments
Closed
Assignees
Labels
stage/waiting-reply theme/batch Issues related to batch jobs and scheduling type/bug

Comments

@tgross
Copy link
Member

tgross commented Apr 6, 2022

In #12147 (comment) @next-jesusmanuelnavarro wrote:

I think I may found a similar situation that leads to the opposite result. In my case is also a batch job (meant to be run daily).

The job "hangs" at the templating step (in my case, because it can't find a vault-based secret which is not there) and never retries.

I mean:

  1. at first run, it tries to template with vault secret, (some/path/secret) and stays in templating status never to fail.
  2. If I go and create the 'some/path/secret' secret at vault, the batch job doesn't launch 24 later because of the first "blocked" run.

Can this behaviour be another aspect of this issue?

The 'restart' stanza from my job follows (while testing this I set the job to run every 5min). Current restart mode is "fail" but I also tested with 'mode=delay' with same results:

periodic {
	cron             = "*/5 * * * *"
	prohibit_overlap = "true"
}
(...)
restart {
	interval = "5m"
	attempts = 3
	delay    = "30s"
	mode     = "fail"
}
@tgross tgross added type/bug stage/waiting-reply theme/batch Issues related to batch jobs and scheduling labels Apr 6, 2022
@tgross
Copy link
Member Author

tgross commented Apr 6, 2022

@next-jesusmanuelnavarro it sounds like your issue is because you've got prohibit_overlap here with a job that can't ever fail. I'm not sure there's a way around that other than ensuring that the job can be failed if it's blocked waiting for vault secrets. Can you describe your use case for not having the batch job fail if its waiting for secrets?

@next-jesusmanuelnavarro
Copy link

First of all, thanks a lot for being so kind as to open this issue yourself.

The 'prohibit_overlap = "true"' is intended since, because of the nature of the job, not two instances should ever run at a time.

Regarding the failure, those vault secrets are used on the hcl itself, i.e.:

 template {
   data = <<EOH
     {{ with secret "some/nonexistent/vault/path" }}
     USERNAME = "{{ .Data.data.username }}"
     PASSWORD = "{{ .Data.data.password }}"
     {{ end }}
   EOH
   destination = "${NOMAD_SECRETS_DIR}/file.env"
   env = true
 }

This above leads to task "stucked":

  1. Recieved: Task Recieved by client
  2. Task setup: Building Task Directory
  3. Template: Missing: vault.read(some/nonexistent/vault/path) (and stays forever in "pending" state)

Since the job never gets the chance to run, it can't fail either.

@tgross
Copy link
Member Author

tgross commented Apr 15, 2022

Hi @next-jesusmanuelnavarro ok that all makes sense. This blocking is a feature of consul-template. If you take a look at the templating language docs you'll see there are some options for setting default values. You might want to considering something like that

I thought that the new template.wait configuration that was added in #11606 (for Nomad 1.2.4) might help but apparently that doesn't apply here because Vault doesn't support blocking queries.

That being said, when I tried to reproduce the same setup, the task eventually times out after 5 minutes. Can you share a more complete jobspec, your client configuration, the full template configuration (including any wait blocks), and vault configuration, redacted as necessary? Also, which version of Nomad are you using?


Here's my setup:

Nomad config for Vault

Running Vault in dev mode:

vault server -dev \
             -dev-listen-address 0.0.0.0:8200 \
             $@

Nomad vault config:

vault {
  enabled = true
  address = "http://127.0.0.1:8200"
  token = "$token"
  allow_unauthenticated = true
}

Resulting node attributes:

$ nomad node status -verbose e60 | grep vault
vault.accessible                 = true
vault.cluster_id                 = b234a255-5a34-6eb5-1d03-9af61444592a
vault.cluster_name               = vault-cluster-4dd62f68
vault.version                    = 1.10.0
Vault setup
$ export VAULT_TOKEN=$token
$ export VAULT_ADDR=http://127.0.0.1:8200

$ vault secrets enable -path=secrets kv-v2
Success! Enabled the kv-v2 secrets engine at: secrets/

# make sure we have a valid secret so we can verify we're set up correctly
$ vault kv put secrets/myapp key=xyzzy
Key                Value
---                -----
created_time       2022-04-15T20:09:38.10578154Z
custom_metadata    <nil>
deletion_time      n/a
destroyed          false
version            1

$ vault policy write myproject ./policy.hcl
Success! Uploaded policy: myproject

$ vault policy read myproject
path "secrets/data/myapp" {
  capabilities = ["read"]
}

First we'll use a working jobspec to demonstrate the happy case:

jobspec
job "batch" {
  datacenters = ["dc1"]
  type        = "batch"

  group "group" {

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "/bin/sh"
        args    = ["-c", "cat local/index.txt; sleep 300; echo done"]
      }

      vault {
        policies = ["myproject"]
      }

      template {
        data        = <<EOH
     {{ with secret "secrets/myapp" }}
     KEY = "{{ .Data.data.key }}"
     {{ end }}
   EOH
        destination = "local/index.txt"
      }

      resources {
        cpu    = 128
        memory = 128
      }

    }
  }
}

Run our batch job with a working secret:

$ nomad job run ./batch.nomad
==> 2022-04-15T16:21:23-04:00: Monitoring evaluation "b019500f"
    2022-04-15T16:21:23-04:00: Evaluation triggered by job "batch"
==> 2022-04-15T16:21:24-04:00: Monitoring evaluation "b019500f"
    2022-04-15T16:21:24-04:00: Allocation "dd7b0f77" created: node "e60eac0d", group "group"
    2022-04-15T16:21:24-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-04-15T16:21:24-04:00: Evaluation "b019500f" finished with status "complete"

$ nomad alloc status dd7
ID                  = dd7b0f77-9c38-a197-b161-6b1f36e893d3
...
Recent Events:
Time                       Type        Description
2022-04-15T16:21:24-04:00  Started     Task started by client
2022-04-15T16:21:23-04:00  Task Setup  Building Task Directory
2022-04-15T16:21:23-04:00  Received    Task received by client

$ nomad alloc fs dd7 task/local/index.txt

     KEY = "xyzzy"

Then I updated the job to have a non-existent secret and it blocks as expected:

$ nomad job run ./batch.nomad
==> 2022-04-15T16:25:44-04:00: Monitoring evaluation "963bdf1f"
    2022-04-15T16:25:44-04:00: Evaluation triggered by job "batch"
    2022-04-15T16:25:44-04:00: Allocation "4068a6bd" created: node "e60eac0d", group "group"
==> 2022-04-15T16:25:45-04:00: Monitoring evaluation "963bdf1f"
    2022-04-15T16:25:45-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-04-15T16:25:45-04:00: Evaluation "963bdf1f" finished with status "complete"

$ nomad alloc status 406
ID                  = 4068a6bd-a329-4e6d-4b6f-9fb7bf3077ca
...

Recent Events:
Time                       Type        Description
2022-04-15T16:25:52-04:00  Template    Missing: vault.read(secrets/nonexistent)
2022-04-15T16:25:49-04:00  Task Setup  Building Task Directory
2022-04-15T16:25:44-04:00  Received    Task received by client

But after 5 minutes I get a permissions failure as the token gets revoked:

$ nomad alloc status 406
ID                   = 4068a6bd-a329-4e6d-4b6f-9fb7bf3077ca
Eval ID              = 963bdf1f
Name                 = batch.group[0]
Node ID              = e60eac0d
Node Name            = proxy
Job ID               = batch
Job Version          = 3
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 5m20s ago
Modified             = 6s ago
Replacement Alloc ID = d80056e5

Task "task" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
128 MHz  128 MiB  300 MiB

Task Events:
Started At     = N/A
Finished At    = 2022-04-15T20:30:53Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2022-04-15T16:30:53-04:00  Killing     Sent interrupt. Waiting 5s before force killing
2022-04-15T16:30:53-04:00  Killing     Template failed: vault.read(secrets/nonexistent): vault.read(secrets/nonexistent): Error making API request.

URL: GET http://127.0.0.1:8200/v1/secrets/data/nonexistent
Code: 403. Errors:

* 1 error occurred:
        * permission denied
2022-04-15T16:25:52-04:00  Template    Missing: vault.read(secrets/nonexistent)
2022-04-15T16:25:49-04:00  Task Setup  Building Task Directory
2022-04-15T16:25:44-04:00  Received    Task received by client

@tgross
Copy link
Member Author

tgross commented May 2, 2022

We've answered this issue as best we can, so I'm going to close it out. If you have more information about the questions I've asked above, please feel free to reopen the issue! Thanks!

@tgross tgross closed this as completed May 2, 2022
@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/waiting-reply theme/batch Issues related to batch jobs and scheduling type/bug
Projects
Development

No branches or pull requests

2 participants