Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable Interpolation error with Consul Workload Identity #20025

Closed
MorphBonehunter opened this issue Feb 22, 2024 · 12 comments · Fixed by #20344
Closed

Variable Interpolation error with Consul Workload Identity #20025

MorphBonehunter opened this issue Feb 22, 2024 · 12 comments · Fixed by #20344
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Milestone

Comments

@MorphBonehunter
Copy link

MorphBonehunter commented Feb 22, 2024

Nomad version

Output from nomad version

Nomad v1.7.5
BuildDate 2024-02-13T15:10:13Z
Revision 5f5d4646198d09b8f4f6cb90fb5d50b53fa328b8

Operating system and Environment details

Arch Linux

Issue

Today i try to migrate to the workload identity mechanism.
While the vault migration went well, the consul part fails on the variable interpolation stuff.

For easiness and simple change possibility i write my jobfiles with extensive use of NOMAD_TASK_NAME variable,
this doesn't work anymore after migration if you use this in the consul service definition part:

    task "ara" {
      template {
      ...
      }
      driver = "docker"
      config {
        image = ...
        hostname = "${NOMAD_TASK_NAME}"
        command = "/opt/ara/ara-init.sh"
        mount {
          type = "volume"
          source = "${NOMAD_TASK_NAME}"
          target = "/opt/ara/data"
        }
        ports = [
          "http"
        ]
        labels {
          cpu_shares = "${NOMAD_CPU_LIMIT}"
          container_hostname = "${NOMAD_TASK_NAME}"
        }
      }
      service {
        name = "${NOMAD_TASK_NAME}"
        tags = [
          "${node.class}",
          "traefik.enable=true",
          "traefik.http.routers.${NOMAD_TASK_NAME}.entryPoints=https",
          "traefik.http.routers.${NOMAD_TASK_NAME}.tls=true",
          "traefik.http.routers.${NOMAD_TASK_NAME}.rule=Host(`${NOMAD_TASK_NAME}....`)",
          "traefik.http.routers.${NOMAD_TASK_NAME}.middlewares=compression@file,secureheaders@file,local-access-only@file",
          "traefik.tags=host_${attr.unique.hostname}"
        ]
        port = "http"
        check {
          name = "${NOMAD_TASK_NAME}"
          type = "http"
          method = "HEAD"
          path = "/"
          interval = "20s"
          timeout = "5s"
          check_restart {
            limit = 3
            grace = "10s"
          }
        }
      }
    }

After that the following error occurs:

failed to setup alloc: pre-run hook "consul" failed: 1 error occurred: * failed to derive Consul token for service ${NOMAD_TASK_NAME}: Unexpected response code: 500 (rpc error making call: invalid bind name: "${nomad_task_name}")

Reproduction steps

Try to use NOMAD_TASK_NAME in service name definition with active workload identity settings.

Expected Result

The job should start like it does before the migration.

Actual Result

The job fails.

I'm not sure if this is a corner case and the usage of variable interpolation in this scenario isn't something i should do, but as it worked before i assume this is a bug.

@tgross
Copy link
Member

tgross commented Feb 22, 2024

Hi @MorphBonehunter! Yeah that looks like a bug for sure (and a regression in any case). We must be missing the interpolation step in the new code. I'll dig into this to verify and then mark this issue for roadmapping (or push a fix sooner rather than later if it's simple).

@tgross tgross self-assigned this Feb 22, 2024
@tgross tgross added this to the 1.7.x milestone Feb 22, 2024
@tgross
Copy link
Member

tgross commented Feb 23, 2024

Hi @MorphBonehunter, I had a chance to investigate this further and it looks like our new consul_hook that we use to derive Consul tokens for WI is indeed missing a call to interpolate. But the identity_hook that signs the workload identity also needs that interpolation step, otherwise we just end up with an error earlier in the process.

I've run out of time for this week to finish my fix (see my working branch), but I should be able to pick it back up next week.

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Feb 23, 2024
@MorphBonehunter
Copy link
Author

Hi @tgross thanks for the update.
Is there any possibility that this also could happen in the vault hook? I do not have an sample for this and also no idea where this could be used but maybe 🤔

@tgross
Copy link
Member

tgross commented Feb 28, 2024

Yeah, it'll almost certainly impact Vault as well because of the missing interpolation in the identity_hook. So that'll get picked up in the same work, but I'll double-check (and ensure we have test coverage) that the vault_hook is working fine with interpolation.

@MorphBonehunter
Copy link
Author

Hey @tgross are there any news on this issue?

@tgross
Copy link
Member

tgross commented Apr 5, 2024

Sorry @MorphBonehunter I didn't get a chance to follow-thru on this. I've got a branch that fixes the issue in the Consul hook but not the WI hook (which is a little trickier). I'll try to get this resolved in the next week.

@tgross
Copy link
Member

tgross commented Apr 9, 2024

Hey @MorphBonehunter I've got a draft PR up #20344 which I've still got to do some end-to-end testing on. I'll hopefully have that ready for review tomorrow with an eye towards getting into the next patch release. Thanks for your patience!

@MorphBonehunter
Copy link
Author

Hi @tgross, that sounds great. I didn't want to create any pressure here, i just want to ask :)

@tgross
Copy link
Member

tgross commented Apr 10, 2024

So the patch I've got up is close but not quite enough, and I'm running into a design issue I'm going to have to work out. I can get all the signing requests from the client to be interpolated just fine. But when it comes to getting those identities signed, we send an "identity handle" up to the server and tell it to sign the identity from the jobspec that matches that handle.

The problem is that the identity handle in the request is interpolated so it won't match the identity handle from the identity in the server's version of the jobspec (which is derived from the service block here). And on the server we can't interpolate anything other than information the server knows that's generic to the jobspec. Ex. we could interpolate $NOMAD_TASK_NAME but we can't interpolate $NOMAD_NODE_ID because that's per-alloc. Even if we could fix the handle, the identity block on the server would also need to be interpolated so that we're not signing for a Consul service named literally ${NOMAD_TASK_NAME}.

I'm bringing this back to the team for some internal discussion around what to do with it, and will report back once I know more.

tgross added a commit that referenced this issue Apr 10, 2024
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock signer
  and never call `Run` on it.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025
@tgross
Copy link
Member

tgross commented Apr 11, 2024

Ok, I've got that problem fixed in #20344 by adding a field to the WIHandle struct that we pass in the signing request that includes the client-interpolated version of the text, without mangling the original handle. I've smoke-tested that and I've marked the PR for review by my colleagues.

tgross added a commit that referenced this issue Apr 11, 2024
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock signer
  and never call `Run` on it.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025
tgross added a commit that referenced this issue Apr 11, 2024
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock
  signer and never call `Run` on it. It wasn't feasible to exercise the correct
  behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025
tgross added a commit that referenced this issue Apr 11, 2024
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock
  signer and never call `Run` on it. It wasn't feasible to exercise the correct
  behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025
tgross added a commit that referenced this issue Apr 11, 2024
…es into release/1.7.x (#20363)

Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock
  signer and never call `Run` on it. It wasn't feasible to exercise the correct
  behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025

Co-authored-by: Tim Gross <[email protected]>
@MorphBonehunter
Copy link
Author

@tgross Thanks for figuring this out, great work.

philrenaud pushed a commit that referenced this issue Apr 18, 2024
Services can have some of their string fields interpolated. The new Workload
Identity flow doesn't interpolate the services before requesting signed
identities or using those identities to get Consul tokens.

Add support for interpolation to the WID manager and the Consul tokens hook by
providing both with a taskenv builder. Add an "interpolate workload" field to
the WI handle to allow passing the original workload name to the server so the
server can find the correct service to sign.

This changeset also makes two related test improvements:
* Remove the mock WID manager, which was only used in the Consul hook tests and
  isn't necessary so long as we provide the real WID manager with the mock
  signer and never call `Run` on it. It wasn't feasible to exercise the correct
  behavior without this refactor, as the mocks were bypassing the new code.
* Fixed swapped expect-vs-actual assertions on the `consul_hook` tests.

Fixes: #20025
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants