Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul ACL token is not recreated if deleted #20185

Closed
lgfa29 opened this issue Mar 21, 2024 · 5 comments · Fixed by #24167
Closed

Consul ACL token is not recreated if deleted #20185

lgfa29 opened this issue Mar 21, 2024 · 5 comments · Fixed by #24167

Comments

@lgfa29
Copy link
Contributor

lgfa29 commented Mar 21, 2024

Nomad version

Nomad v1.7.6
BuildDate 2024-03-12T07:27:36Z
Revision 594fedbfbc4f0e532b65e8a69b28ff9403eb822e

Issue

When using Consul with workload identities, if the ACL token is deleted from Consul it is never recreated and causes the sync loop to fail and exit early, skipping other updates.

Reproduction steps

  1. Start a Consul agent with ACL enabled.
    # consul.hcl
    
    acl = {
      enabled                  = true
      default_policy           = "deny"
      enable_token_persistence = true
    }
    consul agent -dev -config ./consul.hcl
    
  2. Bootstrap Consul ACL system.
    consul acl bootstrap
    
  3. Start a Nomad dev agent with the following configuration.
    # nomad.hcl
    
    consul {
      enabled = true
    
      service_identity {
        aud = ["consul.io"]
        ttl = "1h"
      }
    
      task_identity {
        aud = ["consul.io"]
        ttl = "1h"
      }
    }
    CONSUL_HTTP_TOKEN=... nomad agent -dev -config ./nomad.hcl
    
  4. Configure Consul JWT auth method for Nomad.
    CONSUL_HTTP_TOKEN=... nomad setup consul -y
    
  5. Register job with Consul service.
    # example.nomad.hcl
    
    job "example" {
      group "cache" {
        network {
          port "db" {
            to = 6379
          }
        }
      
        service {
          name = "redis"
          port = "db"
        }
      
        task "redis" {
          driver = "docker"
      
          config {
            image = "redis:7"
            ports = ["db"]
          }
        }
      }
    }
    nomad run example.nomad.hcl
    
  6. Delete the Consul ACL token created for the service.
  7. Modify the job service with a non-destructive update and run the job again.
    job "example" {
      group "cache" {
        network {
          port "db" {
            to = 6379
          }
        }
    
        service {
          name = "redis"
          port = "db"
    
    +     meta {
    +       test = "1"
    +    }
        }
    
        task "redis" {
          driver = "docker"
    
          config {
            image = "redis:7"
            ports = ["db"]
          }
        }
      }
    }
    nomad run example.nomad.hcl
    

Expected Result

A new token is created and the service meta is updated.

Actual Result

Service is not updated and the sync loop fails, exiting early and preventing other updates as well.

2024-03-21T16:11:46.601-0400 [ERROR] consul.sync: still unable to update services in Consul: failures=10 error="Unexpected response code: 403 (ACL not found)"
@jorgemarey
Copy link
Contributor

Hi @lgfa29 it's important for us to have this fixed before stating to use identities for consul on nomad. Is there any way I can help fix this?

We set a low max_ttl on the consul auth method and hit this. I guess that by how this works we shouldn't set a max_ttl for the nomad auth method in consul, but in any case we can't allow being unable to update services in a node if for any reason a token is deleted.

@tgross
Copy link
Member

tgross commented Jun 27, 2024

Hi @jorgemarey! Luiz has moved on from HashiCorp, but I've flagged this for prioritization. There's a bit of an architectural challenge here in that the consul_hook that requests tokens from Consul happens early in the alloc setup. There's not really a good way to propagate failures back up to that consul_hook, unless we decide to continually poll Consul (which I suspect will melt Consul and run into caching issues on the agent anyways).

For example, we could pass that token to a template runner and the template runner will hit an error, but there's currently no message channel available for us to say "uh oh, that token is now gone" and get it recreated (and not just recreated but re-polled #23381). I'd also have to look into what would happen here to ex. an Envoy sidecar proxy.

@jorgemarey
Copy link
Contributor

Hi @tgross thanks for the information.

Maybe there could be a channel in the AllocHookResources struct for this kind of events?

@tgross
Copy link
Member

tgross commented Jun 28, 2024

With the caveat that I haven't dug in too far here, yeah the AllocHookResources is how we usually communicate between hooks.

@tgross
Copy link
Member

tgross commented Oct 10, 2024

I've got a docs PR up #24167 which will close this issue as wontfix. Although we could built out a facility for recreating the Consul tokens, Consul doesn't support refreshing tokens. This means it's impossible to update the Envoy sidecar proxies for Consul Service Mesh workloads without tearing down the proxy. Which is precisely what's happening when the tokens get deleted anyways.

@tgross tgross closed this as not planned Won't fix, can't repro, duplicate, stale Oct 10, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Oct 10, 2024
tgross added a commit that referenced this issue Oct 10, 2024
As of #24166, Nomad agents will use their own token to deregister services and
checks from Consul. This returns the deregistration path to the pre-Workload
Identity workflow. Expand the documentation to make clear why certain ACL
policies are required for clients.

Additionally, we did not explicitly call out that auth methods should not set an
expiration on Consul tokens. Nomad does not have a facility to refresh these
tokens if they expire. Even if Nomad could, there's no way to re-inject them
into Envoy sidecars for Consul Service Mesh without recreating the task anyways,
which is what happens today. Warn users that they should not set an expiration.

Closes: #20185 (wontfix)
Ref: https://hashicorp.atlassian.net/browse/NET-10262
tgross added a commit that referenced this issue Oct 11, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
As of #24166, Nomad agents will use their own token to deregister services and
checks from Consul. This returns the deregistration path to the pre-Workload
Identity workflow. Expand the documentation to make clear why certain ACL
policies are required for clients.

Additionally, we did not explicitly call out that auth methods should not set an
expiration on Consul tokens. Nomad does not have a facility to refresh these
tokens if they expire. Even if Nomad could, there's no way to re-inject them
into Envoy sidecars for Consul Service Mesh without recreating the task anyways,
which is what happens today. Warn users that they should not set an expiration.

Closes: #20185 (wontfix)
Ref: https://hashicorp.atlassian.net/browse/NET-10262
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

3 participants