Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad job using template stanza permanently dead after transient consul errors #2623

Closed
stefreak opened this issue May 6, 2017 · 33 comments · Fixed by #11606
Closed

Nomad job using template stanza permanently dead after transient consul errors #2623

stefreak opened this issue May 6, 2017 · 33 comments · Fixed by #11606

Comments

@stefreak
Copy link

stefreak commented May 6, 2017

Nomad version

Output from nomad version

Nomad v0.5.6

Operating system and Environment details

Ubuntu 16.04 LTS

Issue

Every week my job reproducably turns "dead":

ID            = scorecard
Name          = scorecard
Type          = service
Priority      = 50
Datacenters   = cbk1
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
web         2       0         0        0       0         0

Allocations
No allocations placed

I am not sure yet what the cause is, that's why I provide so much info and logs here.

I suspect it could be connected with the fact that all nodes in the nomad cluster reboot regularly, automatically.

Nomad cluster looks like this:

syseleven@nomad-servicehost1:~$ consul members
Node                 Address            Status  Type    Build  Protocol  DC
nomad-loadbalancer0  192.168.2.13:8301  alive   client  0.8.1  2         cbk1
nomad-node0          192.168.2.15:8301  alive   client  0.8.1  2         cbk1
nomad-node1          192.168.2.14:8301  alive   client  0.8.1  2         cbk1
nomad-servicehost0   192.168.2.11:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost1   192.168.2.10:8301  alive   server  0.8.1  2         cbk1
nomad-servicehost2   192.168.2.12:8301  alive   server  0.8.1  2         cbk1

servicehosts are consul and nomad servers, the rest are consul and nomad clients.

Reproduction steps

Not sure yet

Nomad Server & Client logs (if appropriate)

see attachment
scorecard.log.txt

Job file (if appropriate)

job "scorecard" {
  region      = "global"
  datacenters = ["cbk1"]
  priority    = 50

  type = "service"

  group "web" {
    count = 2

    task "scorecard-nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        volumes = [
          "default.conf:/etc/nginx/conf.d/default.conf",
          "public_html:/var/www"
        ]
        port_map {
          http = 80
        }
      }

      service {
        name = "scorecard"
        port = "http"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }

      resources {
        cpu    = 2000 # 2 GHz
        memory = 256 # 1GB
        network {
          mbits = 10
          port "http" {
          }
        }
      }

      template {
        destination   = "default.conf"
        change_mode   = "restart"
        data          = <<EOH

                        server {
                            listen       80;
                            server_name  localhost;

                            error_page   500 502 503 504  /50x.html;
                            index index.html;
                            location / {
                                root   /var/www;
                            }
                            location = /50x.html {
                                root   /usr/share/nginx/html;
                            }
                        }
                        EOH
      }

      artifact {
        source = "{{ HTTPS_ARTIFACT_URL }}/scorecard/public_html.tgz"
        destination = "."
      }

      template {
        source        = "public_html/index.html.tpl"
        destination   = "public_html/index.html"
        change_mode   = "noop"
      }
    }
  }
}
@dadgar
Copy link
Contributor

dadgar commented May 8, 2017

Have you filtered the logs in some way? It is very hard to tell what is happening?

@stefreak
Copy link
Author

@dadgar I am now monitoring that service to know the exact time it stops working. Next week I can then find out from the logs exactly what happened as it happens every week.

@stefreak
Copy link
Author

stefreak commented May 14, 2017

@dadgar OK, like clockwork it happened again :) The services are down again since 06:48:15 UTC

nomad-loadbalancer0:~# nomad status
ID                 Type     Priority  Status
gitlab-runner      service  100       running
hubot              service  50        running
nginx              service  50        dead
rocketchat         service  50        running
scorecard          service  50        dead
scorecard-counter  batch    50        running

nginx should be running on nomad-loadbalancer, but it is "dead"; also scorecard is not running.

A nomad stop {nginx,scorecard}; nomad run {nginx,scorecard}.nomad would now fix the problem

From what I see there could have been transient network problems for a few minutes at that time. Probably consul returned 500s because it could not reach any server.

Some logs on nomad-loadbalancer0 (time is in utc):

May 14 06:42:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:20.140961 [DEBUG] http: Request /v1/agent/servers (202.961µs)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669056 [DEBUG] client: updated allocations at index 104937 (total 4) (pulled 0) (filtered 4)
May 14 06:42:25 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:25.669228 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 4)
May 14 06:42:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:30.142007 [DEBUG] http: Request /v1/agent/servers (28.333µs)
May 14 06:42:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:40.143386 [DEBUG] http: Request /v1/agent/servers (357.05µs)
May 14 06:42:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:42:50.145011 [DEBUG] http: Request /v1/agent/servers (27.24µs)
May 14 06:43:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:00.146532 [DEBUG] http: Request /v1/agent/servers (241.384µs)
May 14 06:43:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:10.147919 [DEBUG] http: Request /v1/agent/servers (226.812µs)
May 14 06:43:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:20.149039 [DEBUG] http: Request /v1/agent/servers (31.036µs)
May 14 06:43:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:30.150695 [DEBUG] http: Request /v1/agent/servers (26.596µs)
May 14 06:43:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:40.152461 [DEBUG] http: Request /v1/agent/servers (273.432µs)
May 14 06:43:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:43:50.154147 [DEBUG] http: Request /v1/agent/servers (210.807µs)
May 14 06:44:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:00.155406 [DEBUG] http: Request /v1/agent/servers (289.979µs)
May 14 06:44:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:10.156772 [DEBUG] http: Request /v1/agent/servers (37.66µs)
May 14 06:44:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:20.158156 [DEBUG] http: Request /v1/agent/servers (30.768µs)
May 14 06:44:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:30.159419 [DEBUG] http: Request /v1/agent/servers (220.105µs)
May 14 06:44:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:40.160583 [DEBUG] http: Request /v1/agent/servers (31.92µs)
May 14 06:44:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:44:50.162294 [DEBUG] http: Request /v1/agent/servers (257.235µs)
May 14 06:45:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:00.163968 [DEBUG] http: Request /v1/agent/servers (204.202µs)
May 14 06:45:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:10.165050 [DEBUG] http: Request /v1/agent/servers (27.475µs)
May 14 06:45:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:20.166408 [DEBUG] http: Request /v1/agent/servers (340.736µs)
May 14 06:45:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:30.168033 [DEBUG] http: Request /v1/agent/servers (28.915µs)
May 14 06:45:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:40.169261 [DEBUG] http: Request /v1/agent/servers (241.882µs)
May 14 06:45:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:45:50.170523 [DEBUG] http: Request /v1/agent/servers (243.546µs)
May 14 06:46:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:00.173718 [DEBUG] http: Request /v1/agent/servers (29.964µs)
May 14 06:46:10 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:10.175084 [DEBUG] http: Request /v1/agent/servers (247.54µs)
May 14 06:46:20 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:20.176312 [DEBUG] http: Request /v1/agent/servers (32.698µs)
May 14 06:46:30 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:30.180016 [DEBUG] http: Request /v1/agent/servers (2.268684ms)
May 14 06:46:40 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:40.184494 [DEBUG] http: Request /v1/agent/servers (33.153µs)
May 14 06:46:50 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:46:50.185996 [DEBUG] http: Request /v1/agent/servers (32.395µs)
May 14 06:47:00 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:00.187530 [DEBUG] http: Request /v1/agent/servers (37.275µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:10.189581 [DEBUG] http: Request /v1/agent/servers (38.058µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(scorecard|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/toilworkPercent): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/midonetCountdown): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/lastIncident): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) kv.block(scorecard/errorbudget): Unexpected response code: 500 (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:16 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 1 after "250ms")
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:20.191015 [DEBUG] http: Request /v1/agent/servers (29.747µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:30.192322 [DEBUG] http: Request /v1/agent/servers (31.648µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:40.193297 [DEBUG] http: Request /v1/agent/servers (36.328µs)
May 14 06:47:46 nomad-loadbalancer0 nomad[1211]:     2017/05/14 06:47:46.091343 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: stream closed

Nomad did not log anything else on that host after 06:47:46 UTC

Any other ideas where to look for what went wrong?

@stefreak
Copy link
Author

I can confirm there was a transient network issue at that time and the virtual machines were probably isolated from each other for up to 10 minutes.

@stefreak
Copy link
Author

nginx and scorecard services have in common that they are using consul_template service discovery and / or key value store functionality. All other services don't do that.

@stefreak
Copy link
Author

stefreak commented May 14, 2017

the consul cluster is now healthy by the way, but wasn't at the time the service stopped. Expected behaviour is

  1. if you start the service for the first time and consul is not available, nomad waits until consul is available, renders the templates and spawns the jobs.
  2. if the service is running, consul_template already rendered a file and if consul is not reachable nomad does simply nothing and leaves the service as-is, waiting for consul to become available again (leaving the job with potentially stale configuration instead of killing it)

also I'd expect recovery once consul returns to functioning state even if 2 is not the case ...

@stefreak
Copy link
Author

This is a no-go for production IMO

@stefreak
Copy link
Author

@dadgar sorry for my last comment, not very helpful of me to complain .... do you have enough info to understand what's going on?

@dadgar
Copy link
Contributor

dadgar commented May 16, 2017

@stefreak Thanks for the report. I understand what is happening now. We have a default retry rate that is not exposed. For 0.6.0 I would like to expose the retry rate so it can be extended and optionally set to retry indefinitely.

@dadgar
Copy link
Contributor

dadgar commented May 16, 2017

Solution is to expose the retry blocks: https://github.com/hashicorp/consul-template/blob/606581f922c6c796300e9221ef6af7e6ce2e2d1b/README.md#configuration-file-format

@stefreak
Copy link
Author

stefreak commented May 16, 2017

OK, thanks, makes sense.

What I don't understand is why it's not the default to retry indefinitely, maybe with a maximum retry interval like 5 minutes? IMO transient errors should not result in permanent loss of service ...

Also, failing to contact consul should not kill the job – it does not if I use consul_template directly.

@stefreak
Copy link
Author

stefreak commented May 16, 2017

The defaults of consul-template, together with a traditional setup like with systemd's Retry=on-failure will lead to indefinite retry and an exponential backoff with maximum wait time of 8 seconds which is pretty reasonable IMO.

If a nomad job fails for any reason, nomad (at least with restart mode delay) will also try to restart it forever, which is the default and IMO should also be the default for the template stanza.

@stefreak stefreak changed the title Nomad job using consul kv permanently dead probably after transient consul errors Nomad job using template stanza permanently dead after transient consul errors May 16, 2017
stefreak added a commit to stefreak/nomad that referenced this issue May 22, 2017
stefreak added a commit to stefreak/nomad that referenced this issue May 22, 2017
@stefreak
Copy link
Author

stefreak commented May 29, 2017

@dadgar exposing retry will not be enough to fix all instances of this, now I encountered another instance of this issue with my patched version of nomad (retries forever)

For some reason it still stopped retrying...

I think I did not see the message yamux: keepalive failed: i/o deadline reached before

(background is the VM had stalled CPUs and I/O issues because of a hypervisor problem)

May 28 06:47:18 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:18.348451 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.376923 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:23.575040 [DEBUG] http: Request /v1/agent/servers (255.197µs)
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391676 [DEBUG] client: RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.391723 [ERR] client: heartbeating failed. Retrying in 18.876272335s: failed to update status: 3 error(s) occurred:
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.11:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]: * RPC failed to server 192.168.2.10:4647: rpc error: No cluster leader
May 28 06:47:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:28.393873 [DEBUG] client.consul: bootstrap contacting following Consul DCs: ["cbk1"]
May 28 06:47:33 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:33.576862 [DEBUG] http: Request /v1/agent/servers (24.23µs)
May 28 06:47:43 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:43.580115 [DEBUG] http: Request /v1/agent/servers (1.700357ms)
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:50.918163 [ERR] client.consul: error discovering nomad servers: 3 error(s) occurred:
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: No cluster leader
May 28 06:47:50 nomad-loadbalancer0 nomad[6984]: * rpc error: EOF
May 28 06:47:53 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:47:53.581774 [DEBUG] http: Request /v1/agent/servers (183.909µs)
May 28 06:47:59 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:47:59 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")
May 28 06:48:01 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:01 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: rpc error: stream closed) (retry attemp
May 28 06:48:03 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:03.583229 [DEBUG] http: Request /v1/agent/servers (32.213µs)
May 28 06:48:09 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:09 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: No cluster leader) (retry attempt 4 aft
May 28 06:48:13 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:13.584402 [DEBUG] http: Request /v1/agent/servers (25.852µs)
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23 [ERR] yamux: keepalive failed: i/o deadline reached
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164905 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.164933 [DEBUG] client: RPC failed to server 192.168.2.11:4647: rpc error: EOF
May 28 06:48:23 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:23.585810 [DEBUG] http: Request /v1/agent/servers (179.973µs)
May 28 06:48:28 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:28.174336 [DEBUG] client: RPC failed to server 192.168.2.12:4647: rpc error: No cluster leader
May 28 06:48:29 nomad-loadbalancer0 nomad[6984]: 2017/05/28 06:48:29 [ERR] yamux: keepalive failed: i/o deadline reached
lines 956-1001/1001 (END)

@stefreak
Copy link
Author

@dadgar this time restarting nomad resolved the issue what can I do to get more debug info next time?

@dadgar
Copy link
Contributor

dadgar commented May 30, 2017

@stefreak Your Consul and Nomad cluster lost quorum. The retry behavior would also have covered this:

May 28 06:48:00 nomad-loadbalancer0 nomad[6984]:     2017/05/28 06:48:00 [WARN] (view) health.service(rocketchat|passing): Unexpected response code: 500 (rpc error: EOF) (retry attempt 2 after "500ms")

Restarting Nomad was just coincidence as the problem with the template was Consul queries were failing because of lack of quorum.

@stefreak
Copy link
Author

@dadgar I think this is not covered, because this was using my patched version of nomad with unlimited retry on all servers and clients: https://github.com/stefreak/nomad/tree/v0.5.6-patched

@dadgar
Copy link
Contributor

dadgar commented Jun 20, 2017

@stefreak At least in the logs provided, it does not appear like the consul-template gave up on retrying. it was just that it could not render because Consul did not have a leader.

@stefreak
Copy link
Author

@dadgar I captured the logs one day later and there were no more log entries (lines 956-1001/1001 (END))

So somehow nomad locked up and the job was down, I think for another reason.

Not sure if I can reproduce this behaviour though, if I have a lot of time some day I will try by simulating a very slow / locked up storage.

@stevenscg
Copy link

I believe that I've run into this issue as well. Nomad 0.5.6. Consul 0.8.5.

I updated my test cluster yesterday to Consul 0.8.5 and noticed my service jobs died around that same time. The previous allocations have all been GCd this morning, so I just have the remote logs to go on.

The logs show attempts with retries to the health and KV APIs all reaching maximum retries at/around 5s:

2017/07/05 21:27:35 [ERR] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (exceeded maximum retries)
....
2017/07/05 21:27:31 [WARN] (view) health.service(foo-php-fpm|passing): Get http://169.254.1.1:8500/v1/health/service/foo-php-fpm?index=30674436&passing=1&stale=&wait=60000ms: dial tcp 169.254.1.1:8500: getsockopt: connection refused (retry attempt 5 after "4s")

We regularly upgrade Consul to each point release. We've had these service jobs end up dead like this in the past, but haven't correlated it to the consul agent restart. I'll watch for it more closely on future upgrades.

This is also a test cluster with smaller consul and nomad instances and the baseline raft performance settings for those smaller systems. The larger production systems may be able to restart the local consul agent within the ~5s seconds worth of retries currently available.

@stevenscg
Copy link

I just did some tests for this on a nomad worker in my test cluster and can reproduce the issue.

With 2 task groups of the same application running on 2 different workers, introduce consul agent availability issues on one of the workers....

Test 1: Restart the local consul agent (using systemd "restart")
This was handled with no issues reported and both task groups remained healthy.

Test 2: Stop the local consul agent, wait ~7s, start the local consul agent
This resulted in the tasks failing on the worker node.
Logs show similar patterns of retries through approximately 5s and then failure.

This job is a "service" job and does not currently specify a "restart" stanza. I'm looking at the restart stanza docs and I think the default restart policy should apply here.

@stevenscg
Copy link

I tried adding a "restart" stanza to the task groups in my test job, but that doesn't change the behavior.

The "Test 2" scenario still fails tasks every time. I've also noticed that the "Test 1" scenario will sometimes fail tasks depending on how long it takes the consul agent to come back up.

@stefreak @dadgar Let me know if there's anything else I can test or repro for this.

Having the consul-template retry controls available would be great for tuning certain jobs, but I'd also like to understand if nomad should be restarting these tasks once the local consul agent comes up again.

Should the nomad agent's overall consul retry options be configurable (more than 5 retries / 5 seconds, etc)??

Should the tasks be killed in the first place because the consul agent is down? Mine use the docker driver, if it matters.

@dadgar
Copy link
Contributor

dadgar commented Jul 14, 2017

Just to update, I would like to get this fixed in 0.6.1

@stefreak
Copy link
Author

@dadgar does that mean you are working on it right now? Just asking so we don't duplicate the effort

@dadgar
Copy link
Contributor

dadgar commented Jul 14, 2017

@stefreak Nope haven't started yet. Busy with polishing 0.6.0 and have other things I will be personally tackling in 0.6.1

@danlsgiga
Copy link
Contributor

Hit this issue but with a slightly different behaviour. I have couple tree | explode calls to consul keys to render a template and as soon as I try to add a new key that has subkeys expected by the template, the allocations fail to render the template and even with the change_mode="noop", all the allocations become DEAD and only resubmitting the job can make them run again. I'd expect Nomad to stop / restart the allocation only if the template renders successfully.

@danlsgiga
Copy link
Contributor

Worth mention I'm on Nomad v0.8.0

@danlsgiga
Copy link
Contributor

Any updates on this? I had a full Nomad outage because we lost Consul leadership and all the jobs running entered in the failed state because Nomad was unable to read from the Consul K/V to render the templates.

It would be nice to have a way to tell Nomad to do nothing to running allocations if template rendering fails.

@jcdyer
Copy link

jcdyer commented Dec 21, 2021

@danlsgiga We've managed to pinpoint the code that causes this in nomad. See my comments from a couple days ago on #11209 In short: No way to prevent a full task outage, if you depend on restart or signal change_mode for any of your templates. Changing it to noop should prevent that, but may be undesirable if you need template reloading.

@DerekStrickland
Copy link
Contributor

Reopening per user request (@jcdyer). This comment in #11209 contains the code in question and this comment contains a suggested change in behavior. We'll look into the suggested change and consider the impact.

@tgross
Copy link
Member

tgross commented Aug 22, 2022

Closed by #13041

@tgross tgross closed this as completed Aug 22, 2022
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
7 participants