connect sidecar gets "unauthenticated: ACL not found" #9785

fredwangwang · 2021-01-12T20:22:52Z

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

linux with Consul v1.9.1

Issue

All connect sidecars on al specific nomad node are getting:

[2021-01-12 00:46:43.510][7][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

Almost during the same time.
Looking into the container and found out that in the /secrets/si_token the consul token is not valid anymore (consul acl token read -self ==> ACL not found).

Note that we have other nomad nodes running mesh jobs as well but not getting the issue.
One obvious difference we found is that the node having issue has nomad restarted recently.

Might be a red herring, but is it related to #4226 #9491?

The text was updated successfully, but these errors were encountered:

shoenig · 2021-01-12T21:38:07Z

Hi @fredwangwang thanks for reporting. Do you by any chance see the de-registration of the service(s) in question in the Consul logs? My initial guess is Consul de-registered the service for some reason (e.g. Nomad was running in -dev mode, was removed by anti-entropy, was removed by deregister_critical_service_after, etc.) ~~thus invalidating the token~~ (i'm not sure that's right).

fredwangwang · 2021-01-12T22:08:57Z

Pretty sure when this happens the service registrations are still in Consul and reporting "Healthy" for Connect Sidecar Listening.
It only start to show as "Failing Checks" after I restarted the connect sidecar container, as it cannot communicate with Consul with that invalid token so the connect port does not open properly, which shows Connection Refused.

Sadly I didn't keep enough record so cannot show the prove, but since was looking into the problem right before I opened the issue so my certainty is high.

The nomad servers we are running is not in -dev mode, so not knowing much about anti-entropy, but if its only applies to dev mode then I don't believe thats it then.

idrennanvmware · 2021-01-13T00:36:17Z

@shoenig - we verified that consul health checks on the node were all healthy so, from a UI perspective, everything looked good on the node. The only indication of error was the logs being emitted from the sidecar container. The issue @fredwangwang linked to was rectified by restarting just the container/task allocation #4226 #9491 but in this case that didn't work. We tried manually deleting the secret si_token too, and restarting that container (that made it a lot worse lol).

In the end just restarting the entire group, not just a single task allocation, resolved the issue

fredwangwang · 2021-01-13T04:42:56Z

Just a thought:

Assertions:

the SI token is managed by the Nomad Servers (nomad client --> RPC --> nomad server)
the SI token does not have expiration
the SI token is only revoked by Nomad Servers explicitly by issuing consulACLs.RevokeTokens call

(I hope the above 3 assertions are correct)
Given they are correct:

It rules out Consul been faulty, as only Nomad Servers can revoke the SI token (which we are seeing ACL not found)
Nomad Servers decide its a good idea to revoke those SI tokens for a specific Nomad Client

Then the question becomes: why does nomad servers decide to revoke the SI tokens for a specific Nomad Client?

shoenig · 2021-01-13T16:35:24Z

The SI Tokens are actually generated per instance of each service (i.e. alloc). If you look at the Description you can see what each token was made for:

_nomad_si [e579c705-1394-51e1-7b25-7bc7dd3ff846] [f9edace0-0ca7-bfe4-09fd-fc25321971fc] [frontend]
           (cluster ID)                           (alloc ID)                             (service)

Looking through node_endpoint.go I suspect Nomad decided the Client was determined to be in NodeStatusDown during the restart which triggers a revocation for all SI tokens in use by the Client. I was under the impression all the allocs up to that point would be recycled, but that might have been a bad assumption.

https://www.nomadproject.io/docs/operations/nomad-agent#lifecycle

ianmdrennan · 2021-01-18T17:52:58Z

@shoenig are there any more tests you'd like us to run when we encounter this to give you more info?

fredwangwang · 2021-02-04T02:03:56Z

This has happened again, here is the log from the nomad leader node:

2021-02-04T01:44:31.357Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 1087125]: read tcp 10.0.0.95:47658->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:41.312Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38406->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:41.473Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38408->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:51.323Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38480->10.0.0.94:4647: i/o timeout"
2021-02-04T01:44:51.600Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38484->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:00.473Z [INFO]  nomad: memberlist: Marking csl1.global as failed, suspect timeout reached (0 peer confirmations)
2021-02-04T01:45:01.335Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38628->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:01.793Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38634->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:09.323Z [INFO]  nomad.raft: removed peer, stopping replication: peer=10.0.0.94:4647 last-index=1462494
2021-02-04T01:45:11.357Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38728->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:11.949Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38732->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:21.399Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38820->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:22.180Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38824->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:31.480Z [ERROR] nomad.raft: failed to appendEntries to: peer="{Voter 10.0.0.94:4647 10.0.0.94:4647}" error="msgpack encode error: write tcp 10.0.0.95:38916->10.0.0.94:4647: i/o timeout"
2021-02-04T01:45:32.459Z [ERROR] nomad.raft: failed to heartbeat to: peer=10.0.0.94:4647 error="msgpack decode error [pos 0]: read tcp 10.0.0.95:38920->10.0.0.94:4647: i/o timeout"

The result we've seen is that connect sidecar on node 10.0.0.94 start to show:

error creating bootstrap configuration for Connect proxy sidecar: alloc_id=830ac7a6-586c-f8f9-3996-5ff809bba2f4 task=connect-proxy-api error="exit status 1" stderr="==> Failed looking up sidecar proxy info for _nomad-task-830ac7a6-586c-f8f9-3996-5ff809bba2f4-group-api-group-api-api_http: Unexpected response code: 403 (ACL not found)

fredwangwang · 2021-02-25T21:11:36Z

Hi we've just experienced this issue again.
Envoy is spamming the log: [2021-02-25 18:22:28.256][8][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] StreamAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

And during the time we saw a lot nomad heartbeat missed:

MageshSrinivasulu · 2022-09-20T15:36:18Z

@tgross Is there any update on the issue I am facing a similar problem when the pod starts up envoy sidecar failing with the error. This happens while performing maintenance over AKS nodes where I need to move the connect inject enable pods to newer nodes.

[2022-09-20 11:28:42.859][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:101] DeltaAggregatedResources gRPC config stream closed: 16, unauthenticated: ACL not found

NOBLES5E · 2022-11-19T14:51:29Z

The problem still exists on nomad v1.4.2

cr0c0dylus · 2024-04-16T08:38:56Z

The problem still exists in consul 1.18.1 and nomad 1.7.6:

[2024-04-16 08:30:49.192][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:152] DeltaAggregatedResources gRPC config stream to local_agent closed: 16, unauthenticated: ACL not found

dani · 2024-10-15T12:40:13Z

Since I've migrated to workload identities, I have a lot of errors like this. Services randomly (well, it seems random) unreachable because their connect sidecar just fails with the same error

[2024-10-15 11:58:01.236][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:155] DeltaAggregatedResources gRPC config stream to local_agent closed: 16, unauthenticated: ACL not found

Not sure I ever encountered this before using workload identities, but now it's very frequent (to the point consul connect is not usuable at all). I'm using nomad 1.8.4 (servers are 1.9.0 but clients are kept in 1.8.4 because of #24181 ), consul 1.19.2 (will upgrade to 1.20.0 soon)

tgross · 2024-10-15T12:50:14Z

@dani that's likely fixed (indirectly) by #24166 which is shipping in 1.9.1. In any case, this is a very old issue and should have been closed out when we closed #23381.

I know it's probably a little frustrating to see what looks like a similar issue get closed when you're facing something with the same symptoms, but we've been making steady progress on wack-a-mole-ing this unfortunate feature set and it'd be helpful to have new issues spawned from new builds rather than adding to a not-quite-the-same issue that's just going to muddy debugging it.

shoenig added stage/waiting-reply theme/consul/connect Consul Connect integration stage/needs-investigation labels Jan 12, 2021

tgross removed the stage/waiting-reply label Feb 16, 2021

This was referenced Oct 5, 2023

connect sidecar gets "unauthenticated: ACL not found" while replacing node in AKS cluster hashicorp/consul#19070

Open

connect sidecar gets "unauthenticated: ACL not found" while replacing node in AKS cluster hashicorp/consul-k8s#3040

Open

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

tgross closed this as completed Oct 15, 2024

github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connect sidecar gets "unauthenticated: ACL not found" #9785

connect sidecar gets "unauthenticated: ACL not found" #9785

fredwangwang commented Jan 12, 2021

shoenig commented Jan 12, 2021 •

edited

Loading

fredwangwang commented Jan 12, 2021 •

edited

Loading

idrennanvmware commented Jan 13, 2021

fredwangwang commented Jan 13, 2021

shoenig commented Jan 13, 2021

ianmdrennan commented Jan 18, 2021

fredwangwang commented Feb 4, 2021 •

edited

Loading

fredwangwang commented Feb 25, 2021

MageshSrinivasulu commented Sep 20, 2022 •

edited

Loading

NOBLES5E commented Nov 19, 2022 •

edited

Loading

cr0c0dylus commented Apr 16, 2024

dani commented Oct 15, 2024

tgross commented Oct 15, 2024

connect sidecar gets "unauthenticated: ACL not found" #9785

connect sidecar gets "unauthenticated: ACL not found" #9785

Comments

fredwangwang commented Jan 12, 2021

Nomad version

Operating system and Environment details

Issue

shoenig commented Jan 12, 2021 • edited Loading

fredwangwang commented Jan 12, 2021 • edited Loading

idrennanvmware commented Jan 13, 2021

fredwangwang commented Jan 13, 2021

shoenig commented Jan 13, 2021

ianmdrennan commented Jan 18, 2021

fredwangwang commented Feb 4, 2021 • edited Loading

fredwangwang commented Feb 25, 2021

MageshSrinivasulu commented Sep 20, 2022 • edited Loading

NOBLES5E commented Nov 19, 2022 • edited Loading

cr0c0dylus commented Apr 16, 2024

dani commented Oct 15, 2024

tgross commented Oct 15, 2024

shoenig commented Jan 12, 2021 •

edited

Loading

fredwangwang commented Jan 12, 2021 •

edited

Loading

fredwangwang commented Feb 4, 2021 •

edited

Loading

MageshSrinivasulu commented Sep 20, 2022 •

edited

Loading

NOBLES5E commented Nov 19, 2022 •

edited

Loading