Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACL token rotation causes check status inconsistency #4372

Open
danaps opened this issue Jul 10, 2018 · 34 comments
Open

ACL token rotation causes check status inconsistency #4372

danaps opened this issue Jul 10, 2018 · 34 comments
Labels
theme/acls ACL and token generation theme/health-checks Health Check functionality type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified

Comments

@danaps
Copy link

danaps commented Jul 10, 2018

Edit(banks)

Note that the thread concludes a root cause that may not apply to this original issue related to ACLs. See #4372 (comment) for a summary of the issue this represents now


Hi folks,
we run Consul 1.2.0 and encountered the following issue:
in the consul UI, when looking at some node Health check we see it is failing.

`curl http://localhost:8500/v1/health/node/ip-10-1-7-170

{
"Node": "ip-10-1-7-170",
"CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c",
"Name": "Nomad Client HTTP Check",
"Status": "critical",
"Notes": "",
"Output": "",
"ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq",
"ServiceName": "nomad-client",
"ServiceTags": [
"http"
],
"Definition": {},
"CreateIndex": 170787117,
"ModifyIndex": 170787117
},

`
after checking underlying API call /v1/agent/checks on the node the check is actually passing.

` curl http://localhost:8500/v1/agent/checks

"c49af2eedf67fdbf7135045051aad72c5f1d8a4c": {
"Node": "ip-10-1-7-170",
"CheckID": "c49af2eedf67fdbf7135045051aad72c5f1d8a4c",
"Name": "Nomad Client HTTP Check",
"Status": "passing",
"Notes": "",
"Output": "HTTP GET http://0.0.0.0:4646/v1/agent/health?type=client: 200 OK Output: {"client":{"message":"ok","ok":true}}",
"ServiceID": "_nomad-client-vsvkabaz4izafp6d5ejwyhrurtl2pdeq",
"ServiceName": "nomad-client",
"ServiceTags": [
"http"
],
"Definition": {},
"CreateIndex": 0,
"ModifyIndex": 0
},
`

@burdandrei
Copy link
Contributor

Let me rephrase this:
After running 1.2.0 in staging for a week, a week ago we updated our production cluster to 1.2.0.
For monitoring, we use Telegraf that send the data to InfluxDB. After some time we noticed that in grafana we can see less healthy hosts, that are running. After all the debugging we found that some checks that we see in UI (Both official and Hashi-UI) as critical are shown as passing of we're checking it on the agent locally curl http://localhost:8500/v1/agent/checks.
It looks like the UI is using "consensus" endpoint to show the status checks: curl http://localhost:8500/v1/health/node/:node
After comparing the outputs we were able to see that the same check is healthy on agent level, and not healthy on cluster level.

Restarting the consul client updated and syncronises health on cluster level to reflect the real values.
@preetapan @banks this looks like pretty major bug

@banks
Copy link
Member

banks commented Jul 11, 2018

Thanks for the report @danaps and @burdandrei. Are you positive this is a change in behaviour with 1.2.0? Do you have any way to run 1.0.7 on your staging cluster for a day or so to confirm the behaviour goes away?

I ask not to dismiss it but to try to establish if it's a new issue - as far as I know nothing specifically changed in the code path related to agent's updating statuses with the servers. The one major change is the new UI became default and may present the data in a way that makes something that always happened more noticeable (wild speculation but an example of why we should clear up the link between version and bug).

Can you also provide some additional information to help us debug:

  1. How many clients/services/checks are you running?
  2. What percentage of checks are in this inconsistent state?
  3. Do checks that get into this state stay inconsistent indefinitely or "heal" themselves after 10 mins or an hour or a day?
  4. Is it all the same types of checks that are failing?
  5. Are you using ACLs with default deny? If yes are you using Vault or something else to rotate the tokens?
  6. Can you please upload the logs for your cluster leader and at least one agent that has checks in an inconsistent state? A gist works but if you don't want to share them publicly then we can provide a private email address to send them too.

Thanks!

@banks
Copy link
Member

banks commented Jul 11, 2018

For reference for folks who aren't aware, the local agent state and global cluster state are kept in sync by a process we call "anti entropy" syncing. You can read more here: https://www.consul.io/docs/internals/anti-entropy.html#anti-entropy-1

Check status is in general only eventually consistent which is why I asked for details on whether it fixes itself although I assume not as it wasn't mentioned.

The issue is likely to be related to the AE sync mechanism failing which might be a bug but also might be caused by network issues, ACL token issues and several other factors some noted in docs. How long it takes to sync can be impacted by cluster size, number of services and checks running on local agent, available network connectivity etc.

@burdandrei
Copy link
Contributor

Thanks for the answer @banks

  1. we got 158 clients in one DC (112 of them running 1.2.0), others are various old version (don't ask ;))
  2. I can't tell you, but we can see it only on nodes that we're using as nomad "shared" machines, means running a lot of services that are starting and stopping
    2.1 part of the checks are ttl checks that we're setting with an application, and as I can see even when we deregister the application's service, because of checks where added separately, they are staying in the client with associated ServiceID, that is not on the agent anymore /shrug
  3. No they are staying in the wrong state forever
  4. Checks are not failing, most of them are HTTP checks that when we http://localhost:8500/v1/agent/checks they are shown as passing, but http://localhost:8500/v1/health/node/:node they are critical
  5. ACL default policy is Deny. we got couple of applications that are running with dynamic consul acls, but acl_agent_token is static
  6. sure, what logs? we just rotated all the servers to be on the safe side, so the log is pretty empty right now:
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Starting Consul agent...
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Consul agent running!
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Version: 'v1.2.0'
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Node ID: '722ee3f5-3aec-81d2-6d08-62414e5470ff'
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:          Node name: 'ip-10-1-7-187'
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:         Datacenter: 'us-east-1' (Segment: '<all>')
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:             Server: true (Bootstrap: false)
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:        Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:       Cluster Addr: 10.1.7.187 (LAN: 8301, WAN: 8302)
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]:            Encrypt: Gossip: true, TLS-Outgoing: false, TLS-Incoming: false
Jul 11 08:54:06 ip-10-1-7-187 consul[9452]: ==> Log data will now stream in as it occurs:
Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20331
Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: WARNING: LAN keyring exists but -encrypt given, using keyring
Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: WARNING: WAN keyring exists but -encrypt given, using keyring
Jul 11 08:54:07 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20333
Jul 11 08:54:08 ip-10-1-7-187 consul[9452]: 2018/07/11 08:54:08 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.152:20337
Jul 11 08:55:20 ip-10-1-7-187 consul[9452]:     2018/07/11 08:55:20 [ERR] memberlist: Failed fallback ping: write tcp 10.1.7.187:27295->10.1.7.247:8301: i/o timeout
Jul 11 08:55:20 ip-10-1-7-187 consul[9452]: memberlist: Failed fallback ping: write tcp 10.1.7.187:27295->10.1.7.247:8301: i/o timeout
Jul 11 08:59:05 ip-10-1-7-187 consul[9452]:     2018/07/11 08:59:05 [ERR] yamux: Failed to write header: write tcp 10.1.7.187:8300->10.1.3.171:37593: use of closed network connection
Jul 11 08:59:05 ip-10-1-7-187 consul[9452]: yamux: Failed to write header: write tcp 10.1.7.187:8300->10.1.3.171:37593: use of closed network connection
Jul 11 09:08:09 ip-10-1-7-187 consul[9452]: 2018/07/11 09:08:09 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.7.171:20149
Jul 11 09:19:58 ip-10-1-7-187 consul[9452]: 2018/07/11 09:19:58 [ERR] raft-net: Failed to decode incoming command: read tcp 10.1.7.187:8300->10.1.1.152:20333: read: connection reset by peer
Jul 11 09:20:06 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:06 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.7.195:6020
Jul 11 09:20:07 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:07 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.1.115:22475
Jul 11 09:20:10 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:10 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.3.170:33638
Jul 11 09:20:10 ip-10-1-7-187 consul[9452]:     2018/07/11 09:20:10 [ERR] http: Request GET /v1/kv/vault/core/cluster/local/info, error: No cluster leader from=127.0.0.1:24123
Jul 11 09:20:10 ip-10-1-7-187 consul[9452]: http: Request GET /v1/kv/vault/core/cluster/local/info, error: No cluster leader from=127.0.0.1:24123
Jul 11 09:20:11 ip-10-1-7-187 consul[9452]: 2018/07/11 09:20:11 [DEBUG] raft-net: 10.1.7.187:8300 accepted connection from: 10.1.3.251:9630

Our production is really not happy now =(

@burdandrei
Copy link
Contributor

and why 1.0.7? we had 1.1.0 running for a pretty long time in prod.

@banks
Copy link
Member

banks commented Jul 11, 2018

and why 1.0.7? we had 1.1.0 running for a pretty long time in prod.

ENOTENOUGHCOFFEE :D yeah I just meant previous version.

For logs, the most useful ones would be on the agent where the checks are failing - I suspect the issue is that it's not able to sync it's state with servers and it should log about why.

Checks are not failing

All checks when registered start in critical state until the succeed first time so it's possible the initial registration is working but then subsequent updates are not

ACL default policy is Deny. we got couple of applications that are running with dynamic consul acls, but acl_agent_token is static

Interesting, I ask that because anti-entropy requires that the token the application registers with remains valid otherwise it breaks anti-entropy. It might not be the cause but it would def show in logs if it is. We don't do a great job of documenting this or how to work with rotating tokens currently - it's on the near-term list to rework.

If you can post client agent (with "failing" checks) logs for the period in which they appeared as failing that would be great.

Also, do the checks ever show as passing in the Catalog? i.e. do they show as critical the entire life once registered or are they passing for some time and then flip to critical and never recover?

@burdandrei
Copy link
Contributor

I just ran over the cluster and updated all the cluster to 1.2.0 =)
failing checks are remaining critical in catalog sometimes even after the client restart.
checking further with the consistent cluster

@burdandrei
Copy link
Contributor

Just checked and I saw machine with agent/server discrepancy are fixed after a couple of minutes.
Will continue to monitor and update you.
P.S. @banks forgot to tell you- your talk on HashiDays was Awesome =)

@banks
Copy link
Member

banks commented Jul 11, 2018

Interesting. Keep us posted.

Thanks for the compliment @burdandrei ❤️Always nice to get good feedback 😄.

@banks banks added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jul 11, 2018
@burdandrei
Copy link
Contributor

Hi again! =)
Today's news:
I found a service that was deregistered from the agent, and when I'm querying /agent/services endpoint I can't see it, there is no check in /agent/checks endpoint too.
I can see the node in /catalog/service/:service-name endpoint as passing, I can see checks in /health/checks/:service-name as passing as well as in /health/node/:node

After running some consul monitor --log-lovel=debug i can see a lot of these:

2018/07/12 09:35:17 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:17 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:17 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:19 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:19 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:20 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:22 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:24 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: ACL not found
2018/07/12 09:35:24 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:25 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:25 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:25 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:27 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:29 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:30 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:30 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: ACL not found
2018/07/12 09:35:30 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:32 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:32 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:32 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.195:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.1.115:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:34 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.170:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [WARN] agent: Deregistering service "bronto-redis-slave:10.1.7.93:29550" failed. rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:35 [ERR] agent: failed to sync changes: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:37 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.7.187:8300: rpc error making call: ACL not found
2018/07/12 09:35:37 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: ACL not found
2018/07/12 09:35:37 [ERR] agent: failed to sync changes: rpc error making call: ACL not found
2018/07/12 09:35:39 [ERR] consul: "Catalog.Deregister" RPC failed to server 10.1.3.251:8300: rpc error making call: rpc error making call: ACL not found
2018/07/12 09:35:39 [WARN] agent: Deregistering service "hive-cache-master:10.1.7.93:29367" failed. rpc error making call: rpc error making call: ACL not 

Interesting moment: normally in our cluster nomad is the one who registers/deregisters the services in Consul WITHOUT token (Using anonymous token). These services and check where registered with resec - redis sidecar that runs HA redis based on Consul Session locks. Because it has to aquire lock and write to KV it runs with the following ACL:

key "resec/" {
  policy = "write"
}
session "" {
  policy = "write"
}
service "" {
  policy = "write"
}

The token that he receives is temporrary generated by Vault. So maybe there is a correlation to the thing you said about Anti Entropy.
@banks feel free to write/chat/email/call me for further details, cause now it's a bit of deal breaker.
@jippi I know you're running resec too, but without ACLs, have you seen stuck services in the cluster?

@burdandrei
Copy link
Contributor

UPD: after updating ALL the instances in the cluster to 1.2.0 and then draining and terminating all the instances that had not synced services checks I can say that for the last 3 hours everything looks OK.
In the monitoring, we can see that we had a couple of issues with consul leader running ~100% CPU, that was caused by hashicorp/vault#4143 . After bringing everything to normal low cpu/low memory/ low network everything looks ok.

@banks
Copy link
Member

banks commented Jul 12, 2018

Thanks @burdandrei. Sounds like my ACL hunch was right.

This is a known issue (at least I hypothesised it existed several months ago, first time we've seen it bite in the wild) and it has been this way sicne forever - 1.2.0 upgrade must have been bad timing.

The issue is that Vault backend expires tokens (i.e. deleted them from Consul) but the agent that you registered with doesn't know anything about that currently and so it just stops being able to perform antientroy. The only thing it can do here is log that error.

The "fix" if you want to use Vault correctly (which is not documented anywhere and one of the reasons we are going to be doing a lot of work on ACLs and Vault integration soon) is that whenever Vault expires your ACL token, your app (or whatever registered your app in catalog) needs to re-register itself with the new token.

@danaps I'm going to rename this issue to reflect the real problem @burdandrei has since there has now been a lot of context about his case. Can you confirm whether ACLs being deleted is also the cause of the issue for you?

If not, do you mind opening a separate issue with the same info (including why you think it's not ACLs and the agent that is "failing" logs) so we can keep track of what you are seeing please?

@banks banks changed the title check status inconsistency ACL token rotation causes check status inconsistency Jul 12, 2018
@banks banks added type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified and removed waiting-reply Waiting on response from Original Poster or another individual in the thread labels Jul 12, 2018
@burdandrei
Copy link
Contributor

@banks WOW!
We will need to rethink how to run resec now. Sounds like static ACL can be a solution.
And @DPascal and I are talking about the same issue. We're working in the same company and I asked her to open the issue

@burdandrei
Copy link
Contributor

Sorry for necroposting @banks, but should this still be an issue in 1.4 or we can start using the dynamic tokens for registering/deregistering services?

@banks
Copy link
Member

banks commented Feb 18, 2019

@burdandrei the underlying issue here is the same in 1.4. There's no real way to fix it at a fundamental level - if we do something magic and auto-rotate tokens or something so that it never fails then it effectively defies the security benefit you get from rotating since an attacker could use the same mechanism to prolong the life of their stolen token too.

So we prefer to keep it explicit. One thing we've not done yet is fix the docs to make this more clear. There is work going on due to be released in the next month or two that has the potential to help with this issue, but ultimately it still needs the application (or some other tooling) to actively manage the rotation which means re-registering with the new token when it rotates. In some common cases we hope to eventually provide tooling that can be used to do that so you don't have to solve the issue by hand, but it will necessarily be one opinionated way to run things and won't fit everyone's workflow. Better docs on how to do this yourself will come as part of that though.

The short answer is that 1.4.0 ACLs don't really change this: you still need to have something re-register the application instance with it's local agent with a new token before the old one is revoked.

@jippi
Copy link
Contributor

jippi commented Feb 18, 2019

@banks Thanks for the answer. Is this something you and the Nomad team is working on improving as well? Based on what @burdandrei told me, Nomad do not handle this - either at all, or not nicely / seamless

@burdandrei
Copy link
Contributor

Thanks for answer @banks, we're just thinking of the right way to distribute consul tokens to nomad clients.
Do I understand right that if we generate the token per nomad client and it's renewed during all the instance lifecycle we're in a good place, cause the token is scoped to the consul agent where it registers the services?

@banks
Copy link
Member

banks commented Feb 18, 2019 via email

@danlsgiga
Copy link

danlsgiga commented May 8, 2019

I'm facing the very same issue after I integrated Vault to generate on the fly Consul ACL tokens. In our case, we update the acl_default_token via the cli command: consul acl set-agent-token default xxxx. But after Vault rotates the Consul ACL and we inject a the new Consul ACL token via the method above, Nomad calls starts to fail with the same errors and we start to see duplicate entries in Consul for the Catalog since Nomad was unable to deregister the previous task allocation health checks in Consul

[ERR] consul: "Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error making call: rpc error making call: ACL not found

Restarting the consul agent seems to solve the issue until our out-of-band process generate new Consul ACL tokens via Vault.

Update: Restart the consul agent is not sufficient in this case... we have to restart the nomad client as well.

@danlsgiga
Copy link

danlsgiga commented May 13, 2019

I've done some tests to validate the Nomad behaviour once a new ACL token is injected to the Consul agent and the previous one is revoked and I noticed the following:

  • After the ACL tokes are recycled, the first call for updates by Nomad fail with the ACL not found errors.
  • The only side-effects remaining after are errors for [ERR] consul: "Catalog.Deregister" RPC failed to server 1.2.3.4:8300: rpc error making call: rpc error making call: Unknown check '82501dafc1680911b77851756cc2e6aab4fa084c'
  • The ACL not found errors will only happen when vault recycle the Consul ACL token again.

So, it seems that Nomad only tries to refresh its "ACL cache" after the first failure and is able to get things sorted after that.

Does anyone else see the same behaviour in your environments? @banks would that be a possible nomad behaviour in this scenario?

UPDATE: That seems to apply to each Nomad job / Service though.

@banks
Copy link
Member

banks commented May 14, 2019 via email

@danlsgiga
Copy link

Sure does... looking forward for the enhancements to come. Right now we have a single systemd job to handle these issues in a scheduled way, so we should be covered for now.

Thanks for the detailed feedback.

@elcomtik
Copy link

elcomtik commented Dec 1, 2021

The "fix" if you want to use Vault correctly (which is not documented anywhere and one of the reasons we are going to be doing a lot of work on ACLs and Vault integration soon) is that whenever Vault expires your ACL token, your app (or whatever registered your app in catalog) needs to re-register itself with the new token.

@banks any news about the documentation of proper usage? We have several clusters, which are experiencing this issue occasionally.

Our setup consists of the cluster with vault, consul, and nomad servers. Nomad client nodes are accompanied by vault agent and consul agent. Vault agent templates nomad and consul configuration with ACL tokens. Vault agent actively renews tokens, which seems to work OK. Occasionally it stops syncing service instances to the consul catalog. Nomad client restart seems to fix this, however, I would like to be able to prevent such an issue.

EDIT: added component versions
consul: 1.9.4
vault: 1.6.2
nomad: 1.0.4

@schmichael
Copy link
Member

Hi @elcomtik, Nomad engineer here 👋

Just to clarify: are you saying that Vault agent re-templates Nomad's agent configuration file with a new Consul Token, SIGHUP's Nomad, and sometimes Nomad fails to use the new token when registering Consul services? If so this sounds like a bug with Nomad's SIGHUP handling. Please file an issue over at: https://github.com/hashicorp/nomad/issues/new?assignees=&labels=type%2Fbug&template=bug_report.md

@burdandrei
Copy link
Contributor

@elcomtik correct me if I'm wrong.
Vault agent is running and providing Consul ACL for Nomad agent.
Nomad agent is starting and running as usual registering the services using the provided Consul token.
Things go wrong when Consul ACL token reaches Max TTL and gets rotated because for Consul to handle it correct everything registered with the initial token should be de-registered and re-registered using the new dynamic token.
From technical point of view, Nomad agent can listen to USR1 or USR2 to handle this specific case, but it's up to you to decide, @schmichael

@elcomtik
Copy link

elcomtik commented Dec 6, 2021

@elcomtik correct me if I'm wrong. Vault agent is running and providing Consul ACL for Nomad agent. Nomad agent is starting and running as usual registering the services using the provided Consul token. Things go wrong when Consul ACL token reaches Max TTL and gets rotated because for Consul to handle it correct everything registered with the initial token should be de-registered and re-registered using the new dynamic token. From technical point of view, Nomad agent can listen to USR1 or USR2 to handle this specific case, but it's up to you to decide, @schmichael

Yes, the vault agent is providing an ACL token for the nomad agent. I suppose that it goes wrong at max TTL or failed renewal. Then it needs to be rotated by a vault agent and reloaded by a nomad agent.

Vault agent templates it and does restart by

 command     = "systemctl restart nomad-client"

Systemd service is inspired by https://learn.hashicorp.com/tutorials/nomad/production-deployment-guide-vm-with-consul with little customization.

So currently I'm sending SIGINT and restarting, @schmichael mentioned it should be SIGHUPed, you @burdandrei are recommending USR1 or USR2 signals. Are they equivalent are some of them is better?

@burdandrei
Copy link
Contributor

@elcomtik USR1 or USR2 are not implemented in nomad right now.
Nomad agent restart is not enough because by default it's not draining/de-registering anything running on the node.

Here's how I see the flow for this:

  • Vault agent updates Consul ACL token in the file
  • Nomad receives a signal to
    • deregister everything registered with old token(using old token)
    • register everything back using the new one
  • Profit

@elcomtik
Copy link

elcomtik commented Dec 6, 2021

OK, should I try to change systemctl restart for systemctl reload? I might to test it and give you feedback when it's done.

@elcomtik
Copy link

elcomtik commented Dec 6, 2021

@elcomtik USR1 or USR2 are not implemented in nomad right now. Nomad agent restart is not enough because by default it's not draining/de-registering anything running on the node.

@burdandrei few questions emerged:

  • is the nomad agent able to deregister services if the old consul token is already expired?
  • we observed that the nomad service has defined checks, but the consul catalog was reporting them healthy. I would suppose that they would be failing. How is it related to nomad agent ACL token rotation?

@elcomtik
Copy link

@burdandrei I'm just testing this approach with SIGHUP signal. Unfortunately, I came into this bug hashicorp/nomad#3885.

I'm reloading it by vault-agent service, soI wanted to patch this by defining systemd service order. However, this is useless because systemd unit for nomad is using Type simple, which doesn't wait for service startup. I tried to use types forking and notify, however, nomad doesn't support them also.

There is an issue that is aiming for a similar feature hashicorp/nomad#4466, but it seems to be stale for a long time.

I will patch this with some sleep before starting vault-agent, however, I will raise another issue in nomad repo to address this.

@elcomtik
Copy link

For anyone interested, I resolved this issue with Nomad not supporting systemd service types forking or notify by using configuration Restart=always instead of Restart=on-failure. Not clean, but it works for me.

@ninjadude333
Copy link

hi, any updates ?

@jkirschner-hashicorp
Copy link
Contributor

I believe this may be fixed by #16097, which will be included in the upcoming Consul 1.16 release.

CC @pglass

@pglass
Copy link

pglass commented May 24, 2023

#16097 was included in Consul 1.15, which is released. That change helps with failed deregistrations of checks and services, which now use the agent token (always) so deregistration should "just work" if the agent token is valid.

However, that change doesn't help with token expirations causing failed registrations and updates of checks and services. The agent remembers the token used to register a check (or service). After that token expires, subsequent anti-entropy updates to the Consul servers for that check (or service) will fail due the expired token. (The local Consul agent will know the state of the check, but that won't be reflected in the Consul catalog.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation theme/health-checks Health Check functionality type/bug Feature does not function as expected type/docs Documentation needs to be created/updated/clarified
Projects
None yet
Development

No branches or pull requests