Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically evaluate system jobs on a definable interval #11067

Closed
benvanstaveren opened this issue Aug 19, 2021 · 3 comments
Closed

Automatically evaluate system jobs on a definable interval #11067

benvanstaveren opened this issue Aug 19, 2021 · 3 comments

Comments

@benvanstaveren
Copy link

Proposal

It would be nice if Nomad would occasionally (well, via definable interval) re-evaluate system jobs. It's currently triggered if a node rejoins the cluster, but not in a few other cases we've observed.

Use-cases

A few times now we've had issues where vault (or consul) became unavailable on a node. This then caused a running system job allocation to fail, which was not restarted, even though the job spec says to never fail a job. Eventually you end up where in Nomad the job status is 'running' but it has zero allocations.

Upon return of consul (or vault), the job was not re-evaluated, and no allocations were placed. It would be nice if job re-evaluation (for system jobs) would trigger on the return of consul or vault, or at a set interval.

Attempted Solutions

We currently have a cron job that runs a nomad job eval every 30 minutes, but this isn't ideal.

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Thanks for the report @benvanstaveren.

I believe this was recently fixed by #11007 which will be available in the next release.

To test it I used this job:

job "system-test" {
  datacenters = ["dc1"]
  type        = "system"

  group "sleep" {
    task "sleep" {
      driver = "raw_exec"

      config {
        command = "/bin/bash"
        args    = ["${NOMAD_TASK_DIR}/sleep.sh"]
      }

      template {
        data = <<EOF
#!/usr/bin/env bash

while true;
do
  echo {{ key "test" }};
  sleep 5;
done
        EOF

        destination = "local/sleep.sh"
        change_mode = "restart"
      }
    }
  }
}

And these steps:

  1. nomad run system-test.nomad
  2. consul kv put test 1
  3. Wait for alloc to run
  4. Kill Consul
  5. Wait for alloc to fail
  6. Start Consul again

In Nomad v1.1.3 the allocation stayed in the failed status, but building from main the allocation did get restarted.

If possible, could test this out of main and see if your problem is fixed?

@benvanstaveren
Copy link
Author

That actually looks exactly like the issue we've been having - I'm currently on "vacation" (yeah, I make github issues on vacation :D) so won't be able to test right now, but I'm going to go out on a limb and guess this will solve our problem, so I'll close the issue :)

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants