Not reaching vault makes nomad re-schedule all allocations #2689

jorgemarey · 2017-06-05T07:49:52Z

Hi,

I noticed that sometimes all the allocations that are placed on my nomad cluster are re-scheduled simultaniously. This happens very rarely, but it does. I suspect that is related to the Vault integration.

The setup is the next:

3 instances with nomad server and consul agent
3 instances with consul server and nomad client
a buch of other instances with consul agent and nomad client.

Im running consul 0.7.5 and nomad 0.5.6

On other instance I have vault (just one server) and I have nomad configured to use that vault.

The log output when this happens:

logs-client1.txt (client with some allocations)
logs-sched1.txt
logs-sched2.txt
logs-sched3.txt (leader)

This happens when vault is not reachable, maybe due to some kind of network issue. Then nomad starts reescheduling all allocations (all of them were runnning fine).

I'm able to reproduce this behaviour, just by setting a security group on the vault machine denying the access to the port where vault is running.

I don't know if this behaviour is expected, but it doesn't feel right that due to a moment of not reaching vault correctly nomad should start moving allocations around.

dadgar · 2017-06-23T18:34:41Z

Hey @jorgemarey,

Looking at the logs it appears that the allocations are being restarted because the nodes are unreachable by the Nomad servers (we use heartbeating to detect if clients are alive). You can see this in the server logs:

Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.432742 [WARN] nomad.heartbeat: node 'b2a16c9b-984e-4d06-8a7b-299cc811f219' TTL expired
Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.441002 [DEBUG] nomad.client: revoking 1 accessors on node "b2a16c9b-984e-4d06-8a7b-299cc811f219" due to down state
Jun  5 07:08:55 scheduler-3 nomad: 2017/06/05 07:08:55.694015 [WARN] nomad.heartbeat: node '3864e028-574f-42d2-a45e-6cd5c00aa7b5' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.358889 [WARN] nomad.heartbeat: node '1a3f9d9e-8359-477e-82d9-8600f27c939c' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.368177 [DEBUG] nomad.client: revoking 1 accessors on node "1a3f9d9e-8359-477e-82d9-8600f27c939c" due to down state
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.369361 [DEBUG] http: Request /v1/jobs?prefix=testjob (111.632µs)
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57 [DEBUG] memberlist: TCP connection from=127.0.0.1:34338
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.982567 [WARN] nomad.heartbeat: node 'e6fc28c0-ee16-4114-b7bd-2dfb949b0138' TTL expired
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995397 [DEBUG] worker: dequeued evaluation 1af847f3-8a68-f5db-966c-fc9aeafadfb0
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995499 [DEBUG] sched: <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>: allocs: (place 10) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Jun  5 07:08:57 scheduler-3 nomad: 2017/06/05 07:08:57.995542 [DEBUG] sched: <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>: setting status to complete
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.000676 [DEBUG] worker: updated evaluation <Eval '1af847f3-8a68-f5db-966c-fc9aeafadfb0' JobID: 'testjob'>
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.000706 [DEBUG] worker: ack for evaluation 1af847f3-8a68-f5db-966c-fc9aeafadfb0
Jun  5 07:08:58 scheduler-3 nomad: 2017/06/05 07:08:58.436426 [DEBUG] http: Request /v1/job/testjob/evaluations (380.341µs)
Jun  5 07:08:59 scheduler-3 nomad: 2017/06/05 07:08:59.109917 [WARN] nomad.heartbeat: node '7e9fa8b2-a6aa-4b36-94df-b66e3562f113' TTL expired
Jun  5 07:09:00 scheduler-3 nomad: 2017/06/05 07:09:00.554382 [WARN] nomad.heartbeat: node 'c5cb4d3e-6057-4a76-a259-49d220d71fb4' TTL expired
Jun  5 07:09:00 scheduler-3 nomad: 2017/06/05 07:09:00.617842 [WARN] nomad.heartbeat: node 'c559a6c8-a78c-49d8-a437-2ef07d4d046a' TTL expired
Jun  5 07:09:01 scheduler-3 nomad: 2017/06/05 07:09:01.114910 [DEBUG] http: Request /v1/status/peers (54.547µs)
Jun  5 07:09:01 scheduler-3 nomad: 2017/06/05 07:09:01.526295 [DEBUG] http: Request /v1/job/testjob (228.738µs)
Jun  5 07:09:02 scheduler-3 nomad: 2017/06/05 07:09:02.588187 [DEBUG] http: Request /v1/job/testjob/summary (95.34µs)
Jun  5 07:09:02 scheduler-3 nomad: 2017/06/05 07:09:02.628075 [WARN] nomad.heartbeat: node 'ac9174e2-8241-463d-8002-ce73085997b3' TTL expired
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.393218 [WARN] nomad.heartbeat: node '0ede4a3a-c363-40e0-8072-fa66126df861' TTL expired
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403108 [DEBUG] worker: dequeued evaluation daa97545-30be-c8d3-f2c0-d298db1f8f8d
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403181 [DEBUG] sched: <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>: allocs: (place 0) (update 0) (migrate 0) (stop 0) (ignore 0) (lost 0)
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.403191 [DEBUG] sched: <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>: setting status to complete
Jun  5 07:09:05 scheduler-3 nomad: 2017/06/05 07:09:05.407050 [DEBUG] worker: updated evaluation <Eval 'daa97545-30be-c8d3-f2c0-d298db1f8f8d' JobID: 'testjob'>

When Vault is down, Nomad will not start rescheduling allocations. What the client will do is attempt to keep renewing the token until it expires. After that it will attempt to retrieve new token and once it successfully has it will take action based on the config vault change_mode configured.

I just tried this and you get continued renewal attempts as expected. The logs start right after taking Vault down:

2017/06/23 18:32:43.065439 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 17
    2017/06/23 18:32:43.065581 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:00.065431409 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:32:47.897486 [INFO] fingerprint.vault: Vault is unavailable
    2017/06/23 18:32:48.513301 [DEBUG] client: state changed, updating node.
    2017/06/23 18:32:48.514728 [INFO] client: node registration complete
    2017/06/23 18:32:55.369069 [DEBUG] http: Request /v1/jobs (715.78µs)
    2017/06/23 18:32:56.185014 [DEBUG] http: Request /v1/jobs?prefix=e (353.303µs)
    2017/06/23 18:32:56.186309 [DEBUG] http: Request /v1/job/example (124.032µs)
    2017/06/23 18:32:56.188266 [DEBUG] http: Request /v1/job/example/allocations?all=false (182.794µs)
    2017/06/23 18:32:56.189775 [DEBUG] http: Request /v1/job/example/evaluations (109.272µs)
    2017/06/23 18:32:56.193036 [DEBUG] http: Request /v1/job/example/summary (1.059045ms)
    2017/06/23 18:33:00.068070 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 18
    2017/06/23 18:33:00.068146 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:18.068060858 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:18.070571 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 13
    2017/06/23 18:33:18.071624 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:31.070563996 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:31.072939 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 19
    2017/06/23 18:33:31.073044 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:33:50.072932144 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused
    2017/06/23 18:33:50.074359 [DEBUG] client.vault: req.increment: 30, leaseDuration: 30, duration: 10
    2017/06/23 18:33:50.074402 [ERR] client.vault: renewal of lease or token failed due to a non-fatal error. Retrying at 2017-06-23 18:34:00.074351749 +0000 UTC: failed to renew the vault token: Put http://127.0.0.1:8200/v1/auth/token/renew-self: dial tcp 127.0.0.1:8200: getsockopt: connection refused

So I think something else is going on in your test environment!

jorgemarey · 2017-06-23T19:37:41Z

Hi @dadgar, thank you for your answer.

At first I thought that too. That allocations are being restarted due to the nodes being unreachable (as is seen in the logs), but it felt weird that all of the nodes had lost contact at the same time. Then I tried to cut the commucation between the servers and vault and it happen again.

The behaviour that I was expecting is what you explain. I'll try to see if there's something else going on (maybe is something weird in my environment).

Thanks a lot.

jorgemarey · 2017-07-17T11:18:07Z

Hi @dadgar,

Sorry to bother you again with this, but I keep seing this behaviour on the cluster that I have set up.
I'll try to explain my problem a bit better.

As you said, allocations are being restarted because the nodes are unreachable by the Nomad servers. That is what is happening with the allocations. The thing is that I don't know why all nodes are set as unreachable at the same time, when there's a problem reaching vault.

If I block the connection to vault, all nodes are marked as down by the server. I was able to reproduce this (althought not on my laptop with the dev server). Every time I do that all nodes are being set as unreachable within seconds.

Some logs (afther blocking the connection with vault).

The nomad server leader:


Jul 17 09:43:39 server nomad: 2017/07/17 09:43:39.924270 [WARN] nomad.heartbeat: node '505abc2c-9f89-40bf-81e8-cd074937ec84' TTL expired
Jul 17 09:43:41 server nomad: 2017/07/17 09:43:41.265589 [WARN] nomad.heartbeat: node '70d5c037-42d1-464d-ab02-fbe141b5e467' TTL expired
Jul 17 09:43:41 server nomad: 2017/07/17 09:43:41.759747 [WARN] nomad.heartbeat: node '92ddafbf-4014-4eed-8abc-ba37771cdbaf' TTL expired
Jul 17 09:43:43 server nomad: 2017/07/17 09:43:43.310631 [WARN] nomad.heartbeat: node '683a0a18-0395-4fc9-9225-5eb30276a247' TTL expired
Jul 17 09:43:45 server nomad: 2017/07/17 09:43:45.752570 [WARN] nomad.heartbeat: node '40625c55-98b6-4e4e-8a45-2b9ac171fdd1' TTL expired
Jul 17 09:43:47 server nomad: 2017/07/17 09:43:47.202172 [WARN] nomad.heartbeat: node '9e3b96c1-733c-4073-9205-f02cbad2658b' TTL expired
Jul 17 09:43:47 server nomad: 2017/07/17 09:43:47.853258 [WARN] nomad.heartbeat: node '283e35b3-d42c-4b38-8864-4cc12c9dfa63' TTL expired
Jul 17 09:43:52 server nomad: 2017/07/17 09:43:52.105220 [WARN] nomad.heartbeat: node 'd0e949f1-b694-427a-a5be-94aa5e4b89f7' TTL expired

//all nodes again after a few seconds

Jul 17 09:44:24 server nomad: 2017/07/17 09:44:24.220079 [WARN] nomad.heartbeat: node 'd0e949f1-b694-427a-a5be-94aa5e4b89f7' TTL expired
Jul 17 09:44:24 server nomad: 2017/07/17 09:44:24.978572 [WARN] nomad.heartbeat: node '92ddafbf-4014-4eed-8abc-ba37771cdbaf' TTL expired
Jul 17 09:44:29 server nomad: 2017/07/17 09:44:29.415200 [WARN] nomad.heartbeat: node '683a0a18-0395-4fc9-9225-5eb30276a247' TTL expired
Jul 17 09:44:30 server nomad: 2017/07/17 09:44:30.516824 [WARN] nomad.heartbeat: node '40625c55-98b6-4e4e-8a45-2b9ac171fdd1' TTL expired
Jul 17 09:44:30 server nomad: 2017/07/17 09:44:30.985788 [WARN] nomad.heartbeat: node '9e3b96c1-733c-4073-9205-f02cbad2658b' TTL expired
Jul 17 09:44:31 server nomad: 2017/07/17 09:44:31.397590 [WARN] nomad.heartbeat: node '70d5c037-42d1-464d-ab02-fbe141b5e467' TTL expired
Jul 17 09:44:35 server nomad: 2017/07/17 09:44:35.569198 [WARN] nomad.heartbeat: node '505abc2c-9f89-40bf-81e8-cd074937ec84' TTL expired
Jul 17 09:44:45 server nomad: 2017/07/17 09:44:45.391661 [WARN] nomad.heartbeat: node '283e35b3-d42c-4b38-8864-4cc12c9dfa63' TTL expired

These messages keep happening until I enable the connection with vault.

Node 283e35b3-d42c-4b38-8864-4cc12c9dfa63:

Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.764316 [INFO] client: marking allocation 1fa3309f-4cf3-b339-fa56-b37202e662ab for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.767196 [INFO] client: marking allocation 1fa3309f-4cf3-b339-fa56-b37202e662ab for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770138 [ERR] client.vault: failed to derive vault tokens: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770153 [ERR] client.vault: failed to derive token for allocation "1fa3309f-4cf3-b339-fa56-b37202e662ab" and tasks [fabio]: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.770159 [ERR] client: failed to derive Vault token for task fabio on alloc "1fa3309f-4cf3-b339-fa56-b37202e662ab": Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.831610 [INFO] client: marking allocation e3792c38-75b3-c4bd-c5c2-33bb50a2cd35 for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.832124 [INFO] client: marking allocation e3792c38-75b3-c4bd-c5c2-33bb50a2cd35 for GC
Jul 17 09:44:01 client-1 nomad: 2017/07/17 09:44:01.892061 [INFO] client: node registration complete
Jul 17 09:44:45 client-1 nomad: 2017/07/17 09:44:45.970412 [INFO] client: marking allocation 781b01ce-f173-5ff8-98d6-647ad03c7351 for GC
Jul 17 09:44:46 client-1 nomad: 2017/07/17 09:44:46.747740 [INFO] client: marking allocation 781b01ce-f173-5ff8-98d6-647ad03c7351 for GC

Node 40625c55-98b6-4e4e-8a45-2b9ac171fdd1:

Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.699343 [INFO] client: marking allocation 75928cd5-8e98-312b-0f45-c4cc78c0dd8f for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.701730 [INFO] client: marking allocation 75928cd5-8e98-312b-0f45-c4cc78c0dd8f for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702837 [ERR] client.vault: failed to derive vault tokens: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702847 [ERR] client.vault: failed to derive token for allocation "75928cd5-8e98-312b-0f45-c4cc78c0dd8f" and tasks [fabio]: Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.702867 [ERR] client: failed to derive Vault token for task fabio on alloc "75928cd5-8e98-312b-0f45-c4cc78c0dd8f": Can't request Vault token for terminal allocation
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.760801 [INFO] client: marking allocation 0fcadd17-69a6-e9cb-035b-135484ea7f23 for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.761265 [INFO] client: marking allocation 0fcadd17-69a6-e9cb-035b-135484ea7f23 for GC
Jul 17 09:44:01 client-2 nomad: 2017/07/17 09:44:01.824326 [INFO] client: node registration complete
Jul 17 09:44:30 client-2 nomad: 2017/07/17 09:44:30.132137 [INFO] client: marking allocation 23ba1c67-5d1e-1fd5-0358-1f8b40be227c for GC
Jul 17 09:44:45 client-2 nomad: 2017/07/17 09:44:45.683035 [INFO] client: marking allocation 23ba1c67-5d1e-1fd5-0358-1f8b40be227c for GC

The rest of the nodes have the same kind of logs.

I found weird the following: [INFO] client: node registration complete on the clients. I don't know if that's normal.

Maybe I have some problem on my configuration, the set up, or other stuff, but keep seing this (nodes unreacheable) every time there's network problem reaching vault.

Can you think of something that could trigger this kind of behaviour?

Thanks!

dadgar · 2017-07-17T17:44:27Z

@jorgemarey I will reopen and try to reproduce post 0.6.0 and if it is happening, I will get a fix for 0.6.1!

jorgemarey · 2017-07-18T06:55:49Z

Thanks a lot! I'll upgrade when it's released and test this.

jorgemarey · 2017-09-11T10:21:25Z

Hi @dadgar

I just noticed this again on a cluster with nomad v0.6.2, consul 0.9.2 and vault 0.8.2. We just had a problem with consul (used as storage for vault) and at that time we started seeing all allocations being re-scheduled.

Some logs:

Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.015945 [WARN] nomad.heartbeat: node 'e7fd0c16-54df-aef9-191a-b5335d5ca3ce' TTL expired
Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.526508 [WARN] nomad.heartbeat: node '32e86f68-ba90-1a1f-3399-22320cf83b94' TTL expired
Sep 11 09:29:55 scheduler-2 nomad: 2017/09/11 09:29:55.978191 [WARN] nomad.heartbeat: node 'ed4fdd36-9be3-6882-8368-3737c53248c4' TTL expired
Sep 11 09:29:57 scheduler-2 nomad: 2017/09/11 09:29:57.878707 [WARN] nomad.heartbeat: node '211a872b-4e58-dc65-e8b1-aa193321e114' TTL expired
Sep 11 09:29:59 scheduler-2 nomad: 2017/09/11 09:29:59.928270 [WARN] nomad.heartbeat: node 'a4198cb4-82d9-070e-c00e-1ada8057fd19' TTL expired
Sep 11 09:30:00 scheduler-2 nomad: 2017/09/11 09:30:00.550745 [WARN] nomad.heartbeat: node '8b87760b-f517-152d-af57-8a8d66014fe8' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.144295 [WARN] nomad.heartbeat: node 'df089a5b-be37-4b1d-d637-ac58e0391dda' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.702767 [WARN] nomad.heartbeat: node '0be504bd-e798-9659-9913-f5e7475036ed' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.841496 [WARN] nomad.heartbeat: node '6933fc67-e85f-0cf0-1bf9-ccff71acaaaf' TTL expired
Sep 11 09:30:01 scheduler-2 nomad: 2017/09/11 09:30:01.985344 [WARN] nomad.heartbeat: node '78ee629a-723e-f2ec-043b-32342a7657f8' TTL expired
Sep 11 09:30:03 scheduler-2 nomad: 2017/09/11 09:30:03.180718 [WARN] nomad.heartbeat: node '7b511b69-e647-758c-91f0-b9c607180cb3' TTL expired
Sep 11 09:30:04 scheduler-2 nomad: 2017/09/11 09:30:04.340450 [WARN] nomad.heartbeat: node '989d8ec9-74cb-16ba-7f8c-685493f850ca' TTL expired
Sep 11 09:30:04 scheduler-2 nomad: 2017/09/11 09:30:04.493615 [WARN] nomad.heartbeat: node '93b6ab35-1d59-829e-1bc9-e3c783fc3667' TTL expired
Sep 11 09:30:06 scheduler-2 nomad: 2017/09/11 09:30:06.633612 [WARN] nomad.heartbeat: node 'bd5852de-ff47-8367-1fc6-67a59961102a' TTL expired
Sep 11 09:30:08 scheduler-2 nomad: 2017/09/11 09:30:08.263329 [WARN] nomad.heartbeat: node '386603d1-f6bf-3292-6b15-870bf6edcfb0' TTL expired
Sep 11 09:30:11 scheduler-2 nomad: 2017/09/11 09:30:11.090010 [WARN] nomad.heartbeat: node '2b3b9f9a-33a9-fd30-1190-eeec47c00b30' TTL expired
Sep 11 09:30:12 scheduler-2 nomad: 2017/09/11 09:30:12.499259 [WARN] nomad.heartbeat: node 'aef94ab7-ca43-af01-2bde-bde54e28cd6c' TTL expired
Sep 11 09:30:15 scheduler-2 nomad: 2017/09/11 09:30:15.676352 [WARN] vault: failed to revoke tokens. Will reattempt til TTL: failed to revoke token (alloc: "f151f448-da1e-78ea-575a-862a3caa09b3", node: "93b6ab35-1d59-829e-1bc9-e3c783fc3667", task: "fabio"): Error making API request.
Sep 11 09:30:15 scheduler-2 nomad: URL: POST https://vault.service.consul:8200/v1/auth/token/revoke-accessor
Sep 11 09:30:15 scheduler-2 nomad: Code: 400. Errors:
Sep 11 09:30:15 scheduler-2 nomad: * 1 error occurred:
Sep 11 09:30:15 scheduler-2 nomad: * invalid accessor
Sep 11 09:30:18 scheduler-2 nomad: 2017/09/11 09:30:18 [ERR] yamux: keepalive failed: i/o deadline reached
Sep 11 09:30:18 scheduler-2 nomad: 2017/09/11 09:30:18.415639 [ERR] nomad.rpc: multiplex conn accept failed: keepalive timeout

It seems the same problem. When our nomad cluster can't reach vault correctly it marks all nodes as down. I don't know why this happens, maybe it's a problem of our set-up...

dadgar · 2017-09-11T17:13:28Z

@jorgemarey Sorry haven't had time to reproduce this yet. Would you mind trying to distill it down to as simple of reproduction steps as possible?

dadgar · 2017-09-12T00:03:17Z

@jorgemarey I got this to reproduce! Thanks!

This PR removes locking around commonly accessed node attributes that do not need to be locked. The locking could cause nodes to TTL as the heartbeat code path was acquiring a lock that could be held for an excessively long time. An example of this is when Vault is inaccessible, since the fingerprint is run with a lock held but the Vault fingerprinter makes the API calls with a large timeout. Fixes #2689

jorgemarey · 2017-09-12T14:06:46Z

Awesome! Thanks a lot @dadgar !

This PR removes locking around commonly accessed node attributes that do not need to be locked. The locking could cause nodes to TTL as the heartbeat code path was acquiring a lock that could be held for an excessively long time. An example of this is when Vault is inaccessible, since the fingerprint is run with a lock held but the Vault fingerprinter makes the API calls with a large timeout. Fixes #2689

github-actions · 2022-12-08T02:16:42Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jorgemarey changed the title ~~Not reaching vault makes nomad reeschedule all allocations~~ Not reaching vault makes nomad re-schedule all allocations Jun 5, 2017

aep mentioned this issue Jun 23, 2017

running vault with nomad which uses vault #2731

Closed

dadgar closed this as completed Jun 23, 2017

dadgar reopened this Jul 17, 2017

dadgar added type/bug stage/needs-investigation theme/vault labels Jul 17, 2017

dadgar mentioned this issue Sep 12, 2017

Non-locked accessors to common Node fields #3195

Merged

dadgar closed this as completed in #3195 Sep 14, 2017

chris93111 mentioned this issue Jul 29, 2020

job stopped and not restarted when vault timeout #8556

Open

github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not reaching vault makes nomad re-schedule all allocations #2689

Not reaching vault makes nomad re-schedule all allocations #2689

jorgemarey commented Jun 5, 2017

dadgar commented Jun 23, 2017

jorgemarey commented Jun 23, 2017

jorgemarey commented Jul 17, 2017

dadgar commented Jul 17, 2017

jorgemarey commented Jul 18, 2017

jorgemarey commented Sep 11, 2017

dadgar commented Sep 11, 2017

dadgar commented Sep 12, 2017

jorgemarey commented Sep 12, 2017

github-actions bot commented Dec 8, 2022

Not reaching vault makes nomad re-schedule all allocations #2689

Not reaching vault makes nomad re-schedule all allocations #2689

Comments

jorgemarey commented Jun 5, 2017

dadgar commented Jun 23, 2017

jorgemarey commented Jun 23, 2017

jorgemarey commented Jul 17, 2017

dadgar commented Jul 17, 2017

jorgemarey commented Jul 18, 2017

jorgemarey commented Sep 11, 2017

dadgar commented Sep 11, 2017

dadgar commented Sep 12, 2017

jorgemarey commented Sep 12, 2017

github-actions bot commented Dec 8, 2022