Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect sidecar gets "unauthenticated: ACL not found" #9785

Closed
fredwangwang opened this issue Jan 12, 2021 · 13 comments
Closed

connect sidecar gets "unauthenticated: ACL not found" #9785

fredwangwang opened this issue Jan 12, 2021 · 13 comments
Labels

Comments

@fredwangwang
Copy link
Contributor

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

linux with Consul v1.9.1

Issue

All connect sidecars on al specific nomad node are getting:

[2021-01-12 00:46:43.510][7][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

Almost during the same time.
Looking into the container and found out that in the /secrets/si_token the consul token is not valid anymore (consul acl token read -self ==> ACL not found).

Note that we have other nomad nodes running mesh jobs as well but not getting the issue.
One obvious difference we found is that the node having issue has nomad restarted recently.

Might be a red herring, but is it related to #4226 #9491?

@shoenig
Copy link
Member

shoenig commented Jan 12, 2021

Hi @fredwangwang thanks for reporting. Do you by any chance see the de-registration of the service(s) in question in the Consul logs? My initial guess is Consul de-registered the service for some reason (e.g. Nomad was running in -dev mode, was removed by anti-entropy, was removed by deregister_critical_service_after, etc.) thus invalidating the token (i'm not sure that's right).

@fredwangwang
Copy link
Contributor Author

fredwangwang commented Jan 12, 2021

Pretty sure when this happens the service registrations are still in Consul and reporting "Healthy" for Connect Sidecar Listening.
It only start to show as "Failing Checks" after I restarted the connect sidecar container, as it cannot communicate with Consul with that invalid token so the connect port does not open properly, which shows Connection Refused.

Sadly I didn't keep enough record so cannot show the prove, but since was looking into the problem right before I opened the issue so my certainty is high.

The nomad servers we are running is not in -dev mode, so not knowing much about anti-entropy, but if its only applies to dev mode then I don't believe thats it then.

@idrennanvmware
Copy link
Contributor

@shoenig - we verified that consul health checks on the node were all healthy so, from a UI perspective, everything looked good on the node. The only indication of error was the logs being emitted from the sidecar container. The issue @fredwangwang linked to was rectified by restarting just the container/task allocation #4226 #9491 but in this case that didn't work. We tried manually deleting the secret si_token too, and restarting that container (that made it a lot worse lol).

In the end just restarting the entire group, not just a single task allocation, resolved the issue

@fredwangwang
Copy link
Contributor Author

Just a thought:

Assertions:

  1. the SI token is managed by the Nomad Servers (nomad client --> RPC --> nomad server)
  2. the SI token does not have expiration
  3. the SI token is only revoked by Nomad Servers explicitly by issuing consulACLs.RevokeTokens call

(I hope the above 3 assertions are correct)
Given they are correct:

  1. It rules out Consul been faulty, as only Nomad Servers can revoke the SI token (which we are seeing ACL not found)
  2. Nomad Servers decide its a good idea to revoke those SI tokens for a specific Nomad Client

Then the question becomes: why does nomad servers decide to revoke the SI tokens for a specific Nomad Client?

@shoenig
Copy link
Member

shoenig commented Jan 13, 2021

The SI Tokens are actually generated per instance of each service (i.e. alloc). If you look at the Description you can see what each token was made for:

_nomad_si [e579c705-1394-51e1-7b25-7bc7dd3ff846] [f9edace0-0ca7-bfe4-09fd-fc25321971fc] [frontend]
           (cluster ID)                           (alloc ID)                             (service)

Looking through node_endpoint.go I suspect Nomad decided the Client was determined to be in NodeStatusDown during the restart which triggers a revocation for all SI tokens in use by the Client. I was under the impression all the allocs up to that point would be recycled, but that might have been a bad assumption.

https://www.nomadproject.io/docs/operations/nomad-agent#lifecycle

@ianmdrennan
Copy link

@shoenig are there any more tests you'd like us to run when we encounter this to give you more info?

@fredwangwang
Copy link
Contributor Author

fredwangwang commented Feb 4, 2021

This has happened again, here is the log from the nomad leader node:

2021-02-04T01:44:31.357Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 1087125]: read tcp 10.0.0.95:47658->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:41.312Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38406->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:41.473Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38408->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:51.323Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38480->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:51.600Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38484->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:00.473Z [INFO]  nomad: memberlist: Marking csl1.global as failed, suspect timeout reached (0 peer confirmations)
2021-02-04T01:45:01.335Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38628->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:01.793Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38634->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:09.323Z [INFO]  nomad.raft: removed peer, stopping replication: peer=10.0.0.94:4647 last-index=1462494
2021-02-04T01:45:11.357Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38728->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:11.949Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38732->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:21.399Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38820->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:22.180Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38824->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:31.480Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38916->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:32.459Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38920->10.0.0.94:4647: i/o timeout"

The result we've seen is that connect sidecar on node 10.0.0.94 start to show:

error creating bootstrap configuration for Connect proxy sidecar: alloc_id=830ac7a6-586c-f8f9-3996-5ff809bba2f4 task=connect-proxy-api error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-830ac7a6-586c-f8f9-3996-5ff809bba2f4-group-api-group-api-api_http: Unexpected response code: 403 (ACL not found)

@fredwangwang
Copy link
Contributor Author

Hi we've just experienced this issue again.
Envoy is spamming the log: [2021-02-25 18:22:28.256][8][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

And during the time we saw a lot nomad heartbeat missed:
image

@MageshSrinivasulu
Copy link

MageshSrinivasulu commented Sep 20, 2022

@tgross Is there any update on the issue I am facing a similar problem when the pod starts up envoy sidecar failing with the error. This happens while performing maintenance over AKS nodes where I need to move the connect inject enable pods to newer nodes.

[2022-09-20 11:28:42.859][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

@NOBLES5E
Copy link
Contributor

NOBLES5E commented Nov 19, 2022

The problem still exists on nomad v1.4.2

@cr0c0dylus
Copy link

The problem still exists in consul 1.18.1 and nomad 1.7.6:

[2024-04-16 08:30:49.192][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:152] DeltaAggregatedResources gRPC config stream to local_agent closed: 16, unauthenticated: ACL not found

@dani
Copy link

dani commented Oct 15, 2024

Since I've migrated to workload identities, I have a lot of errors like this. Services randomly (well, it seems random) unreachable because their connect sidecar just fails with the same error

[2024-10-15 11:58:01.236][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to local_agent closed: 16, unauthenticated: ACL not found

Not sure I ever encountered this before using workload identities, but now it's very frequent (to the point consul connect is not usuable at all). I'm using nomad 1.8.4 (servers are 1.9.0 but clients are kept in 1.8.4 because of #24181 ), consul 1.19.2 (will upgrade to 1.20.0 soon)

@tgross
Copy link
Member

tgross commented Oct 15, 2024

@dani that's likely fixed (indirectly) by #24166 which is shipping in 1.9.1. In any case, this is a very old issue and should have been closed out when we closed #23381.

I know it's probably a little frustrating to see what looks like a similar issue get closed when you're facing something with the same symptoms, but we've been making steady progress on wack-a-mole-ing this unfortunate feature set and it'd be helpful to have new issues spawned from new builds rather than adding to a not-quite-the-same issue that's just going to muddy debugging it.

@tgross tgross closed this as completed Oct 15, 2024
@github-project-automation github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

9 participants