Nomad job using template stanza permanently dead after transient consul errors #2623

stefreak · 2017-05-06T12:21:47Z

Nomad version

Output from nomad version

Nomad v0.5.6

Operating system and Environment details

Ubuntu 16.04 LTS

Issue

Every week my job reproducably turns "dead":

ID            = scorecard
Name          = scorecard
Type          = service
Priority      = 50
Datacenters   = cbk1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         2       0         0        0       0         0

Allocations
No allocations placed

I am not sure yet what the cause is, that's why I provide so much info and logs here.

I suspect it could be connected with the fact that all nodes in the nomad cluster reboot regularly, automatically.

Nomad cluster looks like this:

syseleven@nomad-servicehost1:~$ consul members
Node                 Address            Status  Type    Build  Protocol  DC
nomad-loadbalancer0  192.168.2.13:8301  alive   client  0.8.1  2         cbk1
nomad-node0          192.168.2.15:8301  alive   client  0.8.1  2         cbk1
nomad-node1          192.168.2.14:8301  alive   client  0.8.1  2         cbk1
nomad-servicehost0   192.168.2.11:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost1   192.168.2.10:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost2   192.168.2.12:8301  alive   server  0.8.1  2         cbk1

servicehosts are consul and nomad servers, the rest are consul and nomad clients.

Reproduction steps

Not sure yet

Nomad Server & Client logs (if appropriate)

see attachment
scorecard.log.txt

Job file (if appropriate)

job "scorecard" {
  region      = "global"
  datacenters = ["cbk1"]
  priority    = 50

  type = "service"

  group "web" {
    count = 2

    task "scorecard-nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        volumes = [
          "default.conf:/etc/nginx/conf.d/default.conf",
          "public_html:/var/www"
        ]
        port_map {
          http = 80
        }
      }

      service {
        name = "scorecard"
        port = "http"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        cpu    = 2000 # 2 GHz
        memory = 256 # 1GB
        network {
          mbits = 10
          port "http" {
          }
        }
      }

      template {
        destination   = "default.conf"
        change_mode   = "restart"
        data          = <<EOH

                        server {
                            listen       80;
                            server_name  localhost;

                            error_page   500 502 503 504  /50x.html;
                            index index.html;
                            location / {
                                root   /var/www;
                            }
                            location = /50x.html {
                                root   /usr/share/nginx/html;
                            }
                        }
                        EOH
      }

      artifact {
        source = "{{ HTTPS_ARTIFACT_URL }}/scorecard/public_html.tgz"
        destination = "."
      }

      template {
        source        = "public_html/index.html.tpl"
        destination   = "public_html/index.html"
        change_mode   = "noop"
      }
    }
  }
}

The text was updated successfully, but these errors were encountered:

dadgar · 2017-05-08T18:26:11Z

Have you filtered the logs in some way? It is very hard to tell what is happening?

stefreak · 2017-05-11T08:18:44Z

@dadgar I am now monitoring that service to know the exact time it stops working. Next week I can then find out from the logs exactly what happened as it happens every week.

stefreak · 2017-05-14T13:37:38Z

@dadgar OK, like clockwork it happened again :) The services are down again since 06:48:15 UTC

nomad-loadbalancer0:~# nomad status
ID                 Type     Priority  Status
gitlab-runner      service  100       running
hubot              service  50        running
nginx              service  50        dead
rocketchat         service  50        running
scorecard          service  50        dead
scorecard-counter  batch    50        running

nginx should be running on nomad-loadbalancer, but it is "dead"; also scorecard is not running.

A nomad stop {nginx,scorecard}; nomad run {nginx,scorecard}.nomad would now fix the problem

From what I see there could have been transient network problems for a few minutes at that time. Probably consul returned 500s because it could not reach any server.

Some logs on nomad-loadbalancer0 (time is in utc):

May 14 06:42:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:20.140961 [DEBUG] http: Request /v1/agent/servers (202.961µs)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669056 [DEBUG] client: updated allocations at index 104937 (total 4) (pulled 0) (filtered 4)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669228 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 4)
May 14 06:42:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:30.142007 [DEBUG] http: Request /v1/agent/servers (28.333µs)
May 14 06:42:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:40.143386 [DEBUG] http: Request /v1/agent/servers (357.05µs)
May 14 06:42:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:50.145011 [DEBUG] http: Request /v1/agent/servers (27.24µs)
May 14 06:43:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:00.146532 [DEBUG] http: Request /v1/agent/servers (241.384µs)
May 14 06:43:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:10.147919 [DEBUG] http: Request /v1/agent/servers (226.812µs)
May 14 06:43:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:20.149039 [DEBUG] http: Request /v1/agent/servers (31.036µs)
May 14 06:43:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:30.150695 [DEBUG] http: Request /v1/agent/servers (26.596µs)
May 14 06:43:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:40.152461 [DEBUG] http: Request /v1/agent/servers (273.432µs)
May 14 06:43:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:50.154147 [DEBUG] http: Request /v1/agent/servers (210.807µs)
May 14 06:44:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:00.155406 [DEBUG] http: Request /v1/agent/servers (289.979µs)
May 14 06:44:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:10.156772 [DEBUG] http: Request /v1/agent/servers (37.66µs)
May 14 06:44:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:20.158156 [DEBUG] http: Request /v1/agent/servers (30.768µs)
May 14 06:44:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:30.159419 [DEBUG] http: Request /v1/agent/servers (220.105µs)
May 14 06:44:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:40.160583 [DEBUG] http: Request /v1/agent/servers (31.92µs)
May 14 06:44:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:50.162294 [DEBUG] http: Request /v1/agent/servers (257.235µs)
May 14 06:45:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:00.163968 [DEBUG] http: Request /v1/agent/servers (204.202µs)
May 14 06:45:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:10.165050 [DEBUG] http: Request /v1/agent/servers (27.475µs)
May 14 06:45:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:20.166408 [DEBUG] http: Request /v1/agent/servers (340.736µs)
May 14 06:45:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:30.168033 [DEBUG] http: Request /v1/agent/servers (28.915µs)
May 14 06:45:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:40.169261 [DEBUG] http: Request /v1/agent/servers (241.882µs)
May 14 06:45:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:50.170523 [DEBUG] http: Request /v1/agent/servers (243.546µs)
May 14 06:46:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:00.173718 [DEBUG] http: Request /v1/agent/servers (29.964µs)
May 14 06:46:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:10.175084 [DEBUG] http: Request /v1/agent/servers (247.54µs)
May 14 06:46:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:20.176312 [DEBUG] http: Request /v1/agent/servers (32.698µs)
May 14 06:46:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:30.180016 [DEBUG] http: Request /v1/agent/servers (2.268684ms)
May 14 06:46:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:40.184494 [DEBUG] http: Request /v1/agent/servers (33.153µs)
May 14 06:46:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:50.185996 [DEBUG] http: Request /v1/agent/servers (32.395µs)
May 14 06:47:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:00.187530 [DEBUG] http: Request /v1/agent/servers (37.275µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:10.189581 [DEBUG] http: Request /v1/agent/servers (38.058µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(scorecard|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/toilworkPercent): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/midonetCountdown): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/lastIncident): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/errorbudget): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:20.191015 [DEBUG] http: Request /v1/agent/servers (29.747µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:30.192322 [DEBUG] http: Request /v1/agent/servers (31.648µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:40.193297 [DEBUG] http: Request /v1/agent/servers (36.328µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:46.091343 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: stream closed

Nomad did not log anything else on that host after 06:47:46 UTC

Any other ideas where to look for what went wrong?

stefreak · 2017-05-14T16:08:46Z

I can confirm there was a transient network issue at that time and the virtual machines were probably isolated from each other for up to 10 minutes.

stefreak · 2017-05-14T16:10:26Z

nginx and scorecard services have in common that they are using consul_template service discovery and / or key value store functionality. All other services don't do that.

stefreak · 2017-05-14T16:15:27Z

the consul cluster is now healthy by the way, but wasn't at the time the service stopped. Expected behaviour is

if you start the service for the first time and consul is not available, nomad waits until consul is available, renders the templates and spawns the jobs.
if the service is running, consul_template already rendered a file and if consul is not reachable nomad does simply nothing and leaves the service as-is, waiting for consul to become available again (leaving the job with potentially stale configuration instead of killing it)

also I'd expect recovery once consul returns to functioning state even if 2 is not the case ...

stefreak · 2017-05-14T16:17:17Z

This is a no-go for production IMO

stefreak · 2017-05-15T09:44:25Z

@dadgar sorry for my last comment, not very helpful of me to complain .... do you have enough info to understand what's going on?

dadgar · 2017-05-16T18:52:02Z

@stefreak Thanks for the report. I understand what is happening now. We have a default retry rate that is not exposed. For 0.6.0 I would like to expose the retry rate so it can be extended and optionally set to retry indefinitely.

dadgar · 2017-05-16T18:52:37Z

Solution is to expose the retry blocks: https://github.com/hashicorp/consul-template/blob/606581f922c6c796300e9221ef6af7e6ce2e2d1b/README.md#configuration-file-format

stefreak · 2017-05-16T21:30:56Z

OK, thanks, makes sense.

What I don't understand is why it's not the default to retry indefinitely, maybe with a maximum retry interval like 5 minutes? IMO transient errors should not result in permanent loss of service ...

Also, failing to contact consul should not kill the job – it does not if I use consul_template directly.

stefreak · 2017-05-16T21:40:41Z

The defaults of consul-template, together with a traditional setup like with systemd's Retry=on-failure will lead to indefinite retry and an exponential backoff with maximum wait time of 8 seconds which is pretty reasonable IMO.

If a nomad job fails for any reason, nomad (at least with restart mode delay) will also try to restart it forever, which is the default and IMO should also be the default for the template stanza.

stefreak · 2017-05-29T20:20:57Z

@dadgar exposing retry will not be enough to fix all instances of this, now I encountered another instance of this issue with my patched version of nomad (retries forever)

For some reason it still stopped retrying...

I think I did not see the message yamux: keepalive failed: i/o deadline reached before

(background is the VM had stalled CPUs and I/O issues because of a hypervisor problem)

May 28 06:47:18 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:18.348451 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.376923 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.575040 [DEBUG] http: Request /v1/agent/servers (255.197µs)
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391676 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391723 [ERR] client: heartbeating failed. Retrying in 18.876272335s: failed to update status: 3 error(s) occurred:
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.393873 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["cbk1"]
May 28 06:47:33 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:33.576862 [DEBUG] http: Request /v1/agent/servers (24.23µs)
May 28 06:47:43 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:43.580115 [DEBUG] http: Request /v1/agent/servers (1.700357ms)
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50.918163 [ERR] client.consul: error discovering nomad servers: 3 error(s) occurred:
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: EOF
May 28 06:47:53 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:53.581774 [DEBUG] http: Request /v1/agent/servers (183.909µs)
May 28 06:47:59 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:47:59 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")
May 28 06:48:01 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:01 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: rpc error: stream closed) (retry attemp
May 28 06:48:03 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:03.583229 [DEBUG] http: Request /v1/agent/servers (32.213µs)
May 28 06:48:09 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:09 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: No cluster leader) (retry attempt 4 aft
May 28 06:48:13 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:13.584402 [DEBUG] http: Request /v1/agent/servers (25.852µs)
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164905 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164933 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.585810 [DEBUG] http: Request /v1/agent/servers (179.973µs)
May 28 06:48:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:28.174336 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:48:29 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:48:29 [ERR] yamux: keepalive failed: i/o deadline reached
lines 956-1001/1001 (END)

stefreak · 2017-05-29T20:25:21Z

@dadgar this time restarting nomad resolved the issue what can I do to get more debug info next time?

dadgar · 2017-05-30T17:43:53Z

@stefreak Your Consul and Nomad cluster lost quorum. The retry behavior would also have covered this:

May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")

Restarting Nomad was just coincidence as the problem with the template was Consul queries were failing because of lack of quorum.

stefreak · 2017-06-14T15:59:41Z

@dadgar I think this is not covered, because this was using my patched version of nomad with unlimited retry on all servers and clients: https://github.com/stefreak/nomad/tree/v0.5.6-patched

dadgar · 2017-06-20T22:21:12Z

@stefreak At least in the logs provided, it does not appear like the consul-template gave up on retrying. it was just that it could not render because Consul did not have a leader.

stefreak · 2017-06-22T15:39:36Z

@dadgar I captured the logs one day later and there were no more log entries (lines 956-1001/1001 (END))

So somehow nomad locked up and the job was down, I think for another reason.

Not sure if I can reproduce this behaviour though, if I have a lot of time some day I will try by simulating a very slow / locked up storage.

stevenscg · 2017-07-06T15:01:29Z

I believe that I've run into this issue as well. Nomad 0.5.6. Consul 0.8.5.

I updated my test cluster yesterday to Consul 0.8.5 and noticed my service jobs died around that same time. The previous allocations have all been GCd this morning, so I just have the remote logs to go on.

The logs show attempts with retries to the health and KV APIs all reaching maximum retries at/around 5s:

2017/07/05 21:27:35 [ERR] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (exceeded maximum retries)
....
2017/07/05 21:27:31 [WARN] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (retry attempt 5 after "4s")

We regularly upgrade Consul to each point release. We've had these service jobs end up dead like this in the past, but haven't correlated it to the consul agent restart. I'll watch for it more closely on future upgrades.

This is also a test cluster with smaller consul and nomad instances and the baseline raft performance settings for those smaller systems. The larger production systems may be able to restart the local consul agent within the ~5s seconds worth of retries currently available.

stevenscg · 2017-07-06T15:50:53Z

I just did some tests for this on a nomad worker in my test cluster and can reproduce the issue.

With 2 task groups of the same application running on 2 different workers, introduce consul agent availability issues on one of the workers....

Test 1: Restart the local consul agent (using systemd "restart")
This was handled with no issues reported and both task groups remained healthy.

Test 2: Stop the local consul agent, wait ~7s, start the local consul agent
This resulted in the tasks failing on the worker node.
Logs show similar patterns of retries through approximately 5s and then failure.

This job is a "service" job and does not currently specify a "restart" stanza. I'm looking at the restart stanza docs and I think the default restart policy should apply here.

stevenscg · 2017-07-06T17:01:52Z

I tried adding a "restart" stanza to the task groups in my test job, but that doesn't change the behavior.

The "Test 2" scenario still fails tasks every time. I've also noticed that the "Test 1" scenario will sometimes fail tasks depending on how long it takes the consul agent to come back up.

@stefreak @dadgar Let me know if there's anything else I can test or repro for this.

Having the consul-template retry controls available would be great for tuning certain jobs, but I'd also like to understand if nomad should be restarting these tasks once the local consul agent comes up again.

Should the nomad agent's overall consul retry options be configurable (more than 5 retries / 5 seconds, etc)??

Should the tasks be killed in the first place because the consul agent is down? Mine use the docker driver, if it matters.

dadgar · 2017-07-14T17:43:39Z

Just to update, I would like to get this fixed in 0.6.1

stefreak · 2017-07-14T19:47:37Z

@dadgar does that mean you are working on it right now? Just asking so we don't duplicate the effort

dadgar · 2017-07-14T23:24:53Z

@stefreak Nope haven't started yet. Busy with polishing 0.6.0 and have other things I will be personally tackling in 0.6.1

danlsgiga · 2018-04-19T20:41:55Z

Hit this issue but with a slightly different behaviour. I have couple tree | explode calls to consul keys to render a template and as soon as I try to add a new key that has subkeys expected by the template, the allocations fail to render the template and even with the change_mode="noop", all the allocations become DEAD and only resubmitting the job can make them run again. I'd expect Nomad to stop / restart the allocation only if the template renders successfully.

danlsgiga · 2018-04-19T23:43:54Z

Worth mention I'm on Nomad v0.8.0

danlsgiga · 2020-04-30T00:12:37Z

Any updates on this? I had a full Nomad outage because we lost Consul leadership and all the jobs running entered in the failed state because Nomad was unable to read from the Consul K/V to render the templates.

It would be nice to have a way to tell Nomad to do nothing to running allocations if template rendering fails.

jcdyer · 2021-12-21T20:38:09Z

@danlsgiga We've managed to pinpoint the code that causes this in nomad. See my comments from a couple days ago on #11209 In short: No way to prevent a full task outage, if you depend on restart or signal change_mode for any of your templates. Changing it to noop should prevent that, but may be undesirable if you need template reloading.

DerekStrickland · 2022-01-10T21:25:11Z

Reopening per user request (@jcdyer). This comment in #11209 contains the code in question and this comment contains a suggested change in behavior. We'll look into the suggested change and consider the impact.

tgross · 2022-08-22T13:41:49Z

Closed by #13041

github-actions · 2022-12-21T02:13:18Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label May 8, 2017

dadgar added type/enhancement theme/template and removed stage/waiting-reply labels May 16, 2017

stefreak changed the title ~~Nomad job using consul kv permanently dead probably after transient consul errors~~ Nomad job using template stanza permanently dead after transient consul errors May 16, 2017

stefreak mentioned this issue May 19, 2017

Add a way to specify max wait time between retries hashicorp/consul-template#938

Closed

stefreak added a commit to stefreak/nomad that referenced this issue May 22, 2017

Consul template: retry forever, fixes hashicorp#2623

e7e12d7

stefreak mentioned this issue May 22, 2017

[WIP] Consul template: expose retry configuration, fixes #2623 #2665

Closed

stefreak added a commit to stefreak/nomad that referenced this issue May 22, 2017

Consul template: retry forever, fixes hashicorp#2623

8350480

dadgar mentioned this issue Feb 13, 2018

[Feature] Expose retry behavior for template stanza #3866

Closed

tgross added stage/needs-investigation type/bug and removed type/enhancement labels Aug 24, 2020

tgross added stage/needs-verification Issue needs verifying it still exists theme/consul and removed stage/needs-investigation labels Jan 26, 2021

baabgai mentioned this issue Oct 5, 2021

Nomad jobs should stay up when vault outage occurs #11209

Closed

DerekStrickland mentioned this issue Dec 2, 2021

Expose Consul template configuration parameters #11606

Merged

lgfa29 mentioned this issue Dec 20, 2021

Expose consul-template configuration parameters for operators to tune. #11707

Closed

DerekStrickland closed this as completed in #11606 Jan 10, 2022

DerekStrickland reopened this Jan 10, 2022

tgross closed this as completed Aug 22, 2022

github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad job using template stanza permanently dead after transient consul errors #2623

Nomad job using template stanza permanently dead after transient consul errors #2623

stefreak commented May 6, 2017

dadgar commented May 8, 2017

stefreak commented May 11, 2017

stefreak commented May 14, 2017 •

edited

Loading

stefreak commented May 14, 2017

stefreak commented May 14, 2017

stefreak commented May 14, 2017 •

edited

Loading

stefreak commented May 14, 2017

stefreak commented May 15, 2017

dadgar commented May 16, 2017

dadgar commented May 16, 2017

stefreak commented May 16, 2017 •

edited

Loading

stefreak commented May 16, 2017 •

edited

Loading

stefreak commented May 29, 2017 •

edited

Loading

stefreak commented May 29, 2017

dadgar commented May 30, 2017

stefreak commented Jun 14, 2017

dadgar commented Jun 20, 2017

stefreak commented Jun 22, 2017

stevenscg commented Jul 6, 2017

stevenscg commented Jul 6, 2017

stevenscg commented Jul 6, 2017

dadgar commented Jul 14, 2017

stefreak commented Jul 14, 2017

dadgar commented Jul 14, 2017

danlsgiga commented Apr 19, 2018

danlsgiga commented Apr 19, 2018

danlsgiga commented Apr 30, 2020

jcdyer commented Dec 21, 2021

DerekStrickland commented Jan 10, 2022

tgross commented Aug 22, 2022

github-actions bot commented Dec 21, 2022

Nomad job using template stanza permanently dead after transient consul errors #2623

Nomad job using template stanza permanently dead after transient consul errors #2623

Comments

stefreak commented May 6, 2017

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server & Client logs (if appropriate)

Job file (if appropriate)

dadgar commented May 8, 2017

stefreak commented May 11, 2017

stefreak commented May 14, 2017 • edited Loading

stefreak commented May 14, 2017

stefreak commented May 14, 2017

stefreak commented May 14, 2017 • edited Loading

stefreak commented May 14, 2017

stefreak commented May 15, 2017

dadgar commented May 16, 2017

dadgar commented May 16, 2017

stefreak commented May 16, 2017 • edited Loading

stefreak commented May 16, 2017 • edited Loading

stefreak commented May 29, 2017 • edited Loading

stefreak commented May 29, 2017

dadgar commented May 30, 2017

stefreak commented Jun 14, 2017

dadgar commented Jun 20, 2017

stefreak commented Jun 22, 2017

stevenscg commented Jul 6, 2017

stevenscg commented Jul 6, 2017

stevenscg commented Jul 6, 2017

dadgar commented Jul 14, 2017

stefreak commented Jul 14, 2017

dadgar commented Jul 14, 2017

danlsgiga commented Apr 19, 2018

danlsgiga commented Apr 19, 2018

danlsgiga commented Apr 30, 2020

jcdyer commented Dec 21, 2021

DerekStrickland commented Jan 10, 2022

tgross commented Aug 22, 2022

github-actions bot commented Dec 21, 2022

stefreak commented May 14, 2017 •

edited

Loading

stefreak commented May 14, 2017 •

edited

Loading

stefreak commented May 16, 2017 •

edited

Loading

stefreak commented May 16, 2017 •

edited

Loading

stefreak commented May 29, 2017 •

edited

Loading