-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connect sidecar gets "unauthenticated: ACL not found" #9785
Comments
Hi @fredwangwang thanks for reporting. Do you by any chance see the de-registration of the service(s) in question in the Consul logs? My initial guess is Consul de-registered the service for some reason (e.g. Nomad was running in |
Pretty sure when this happens the service registrations are still in Consul and reporting "Healthy" for Sadly I didn't keep enough record so cannot show the prove, but since was looking into the problem right before I opened the issue so my certainty is high. The nomad servers we are running is not in |
@shoenig - we verified that consul health checks on the node were all healthy so, from a UI perspective, everything looked good on the node. The only indication of error was the logs being emitted from the sidecar container. The issue @fredwangwang linked to was rectified by restarting just the container/task allocation #4226 #9491 but in this case that didn't work. We tried manually deleting the secret si_token too, and restarting that container (that made it a lot worse lol). In the end just restarting the entire group, not just a single task allocation, resolved the issue |
Just a thought: Assertions:
(I hope the above 3 assertions are correct)
Then the question becomes: why does nomad servers decide to revoke the SI tokens for a specific Nomad Client? |
The SI Tokens are actually generated per instance of each service (i.e. alloc). If you look at the
Looking through https://www.nomadproject.io/docs/operations/nomad-agent#lifecycle |
@shoenig are there any more tests you'd like us to run when we encounter this to give you more info? |
This has happened again, here is the log from the nomad leader node:
The result we've seen is that connect sidecar on node
|
Hi we've just experienced this issue again. |
@tgross Is there any update on the issue I am facing a similar problem when the pod starts up envoy sidecar failing with the error. This happens while performing maintenance over AKS nodes where I need to move the connect inject enable pods to newer nodes.
|
The problem still exists on nomad v1.4.2 |
The problem still exists in consul 1.18.1 and nomad 1.7.6: [2024-04-16 08:30:49.192][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:152] DeltaAggregatedResources gRPC config stream to local_agent closed: 16, unauthenticated: ACL not found |
Since I've migrated to workload identities, I have a lot of errors like this. Services randomly (well, it seems random) unreachable because their connect sidecar just fails with the same error
Not sure I ever encountered this before using workload identities, but now it's very frequent (to the point consul connect is not usuable at all). I'm using nomad 1.8.4 (servers are 1.9.0 but clients are kept in 1.8.4 because of #24181 ), consul 1.19.2 (will upgrade to 1.20.0 soon) |
@dani that's likely fixed (indirectly) by #24166 which is shipping in 1.9.1. In any case, this is a very old issue and should have been closed out when we closed #23381. I know it's probably a little frustrating to see what looks like a similar issue get closed when you're facing something with the same symptoms, but we've been making steady progress on wack-a-mole-ing this unfortunate feature set and it'd be helpful to have new issues spawned from new builds rather than adding to a not-quite-the-same issue that's just going to muddy debugging it. |
Nomad version
Nomad v1.0.1 (c9c68aa)
Operating system and Environment details
linux with Consul v1.9.1
Issue
All connect sidecars on al specific nomad node are getting:
Almost during the same time.
Looking into the container and found out that in the
/secrets/si_token
the consul token is not valid anymore (consul acl token read -self
==>ACL not found
).Note that we have other nomad nodes running mesh jobs as well but not getting the issue.
One obvious difference we found is that the node having issue has nomad restarted recently.
Might be a red herring, but is it related to #4226 #9491?
The text was updated successfully, but these errors were encountered: