Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul service deregistration gets "ACL not found" w/ Workload Identities #23494

Closed
sijo13 opened this issue Jul 3, 2024 · 6 comments
Closed
Assignees
Labels

Comments

@sijo13
Copy link

sijo13 commented Jul 3, 2024

Product Version

OS : CentOS Linux release 7.9.2009
Nomad Version : v1.7.3
Consul Version: v1.17.3

Issue Description

We are getting :"Unexpected response code: 403 (ACL not found)" error in nomad logs and due to which services in consul is not getting registered/unregistered unless nomad service is restarted.
In our Nomad cluster we have around 100+ Jobs running and this issue is observed intermittently on one/two Job. Every time its a different Job that gets Impacted.

we are using the JWT mechanism outlined in Consul ACL with Nomad Workload Identities | Nomad | HashiCorp Developer to authenticate Nomad workloads against Consul.

ACL not found Issue is observed only in Test and UAT Environment where we have enabled WI to integrate with Consul.

We verified the Token used for service check from /opt/consul/checks(consul data dir) and found out that token used in check is no longer available in the consul.

And the issue gets fixed only after restart the nomad service.

Reproduction Steps

Tried to reproduce the issue in local dev environment , However in local the service registration and de-registration is happening as expected.

Nomad server and client config snippet

Nomad client:

consul {
  address = "<address>"
  token = "<token>"
}

Nomad Server:

consul {
  address = "<Address>"
  token = "<token>"
service_identity {
  aud = ["consul.io"]
  ttl = "1h"
}

task_identity {
  aud = ["consul.io"]
  ttl = "1h"
}
}

Nomad log snippet:

":"Task started by client","task":"testing","type":"Started"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-07-01T16:01:15.745096+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29b9cc055","failed":false,"msg
":"Exit Code: 1, Exit Message: \"Docker container exited with non-zero exit code: 1\"","task":"testing","type":"Terminated"}
{"@level":"info","@message":"plugin process exited","@module":"client.driver_mgr.docker.docker_logger","@timestamp":"2024-07-01T16:01:15.748890+01:00","driver":"docker","id":"99053","plugin":"/usr/bin/n
omad"}
{"@level":"info","@message":"not restarting task","@module":"client.alloc_runner.task_runner","@timestamp":"2024-07-01T16:01:15.755206+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29b9cc055","reason":"
Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"","task":"testing"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-07-01T16:01:15.755250+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29b9cc055","failed":true,"msg"
:"Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"","task":"testing","type":"Not Restarting"}
{"@level":"info","@message":"plugin process exited","@module":"client.alloc_runner.task_runner.task_hook.logmon","@timestamp":"2024-07-01T16:01:15.765191+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29
b9cc055","id":"98186","plugin":"/usr/bin/nomad","task":"testing"}
{"@level":"info","@message":"(runner) stopping","@module":"agent","@timestamp":"2024-07-01T16:01:15.766256+01:00"}
{"@level":"info","@message":"(runner) received finish","@module":"agent","@timestamp":"2024-07-01T16:01:15.768239+01:00"}
{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2024-07-01T16:01:15.768265+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29b9cc055"}
{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2024-07-01T16:01:15.770237+01:00","alloc_id":"b060c3fb-c027-300f-305f-02e29b9cc055","failed":false,"msg
":"Unhealthy because of failed task","task":"testing","type":"Alloc Unhealthy"}
{"@level":"warn","@message":"failed to update services in Consul","@module":"consul.sync","@timestamp":"2024-07-01T16:01:15.796522+01:00","error":"Unexpected response code: 403 (ACL not found)"}
{"@level":"warn","@message":"unable to fingerprint consul","@module":"client.fingerprint_mgr.consul","@timestamp":"2024-07-01T16:01:56.816331+01:00","attribute":"consul.partition","cluster":"default"}
{"@level":"error","@message":"still unable to update services in Consul","@module":"consul.sync","@timestamp":"2024-07-01T16:02:10.829798+01:00","error":"Unexpected response code: 403 (ACL not found)","failures":10}
@ngcmac
Copy link

ngcmac commented Jul 8, 2024

Hi, i believe this can be the same issue i reported here #16616 (comment).

Thanks.

@sijo13
Copy link
Author

sijo13 commented Jul 8, 2024

@ngcmac, Yes , Its the same behaviour we see in our Nomad infrastructure too.

Job gets restarted because of connectivity issue with vault/consul and it fails to de-register service in consul because of token issue.

However , Issue occurrence is intermediate and couldn't replicate the same behaviour in local dev mode setup .

@tgross tgross changed the title 403 (ACL not found) | Workload Identities Consul service deregistration gets "ACL not found" w/ Workload Identities Jul 8, 2024
@tgross
Copy link
Member

tgross commented Jul 8, 2024

Thanks @sijo13 and @ngcmac. I'll mark this for further investigation.

@tgross tgross moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Jul 8, 2024
@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Jul 8, 2024
@drofloh
Copy link

drofloh commented Aug 30, 2024

Hi, I was wondering if there was any progress on this / resolution? As we get closer to 1.9 and workload identities being required would be nice to know things will work.
Thanks.

@tgross
Copy link
Member

tgross commented Sep 3, 2024

Hi @drofloh, it's still in the queue. But check out the deprecation notices we updated a few weeks ago here: https://developer.hashicorp.com/nomad/docs/release-notes/nomad/upcoming#nomad-1-10-0-lts. WI won't be required for Consul/Vault until Nomad 1.10.0.

@tgross tgross self-assigned this Oct 9, 2024
@tgross
Copy link
Member

tgross commented Oct 9, 2024

I'm picking this up, but I'm going to close it out as strictly a duplicate of #20159. That issue has more data around debugging so it'll be easier to track the work there.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2024
@github-project-automation github-project-automation bot moved this from Needs Roadmapping to Done in Nomad - Community Issues Triage Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

4 participants