vault: fix token revocation during workflow migration #19689

lgfa29 · 2024-01-09T19:35:38Z

When transitioning from the legacy token-based workflow to the new JWT workflow for Vault the previous code would instantiate a no-op Vault if the server configuration had a default_identity block.

This no-op client returned an error for some of its operations were called, such as LookupToken and RevokeTokens. The original intention was that, in the new JWT workflow, none of these methods should be called, so returning an error could help surface potential bugs.

But the RevokeTokens and MarkForRevocation methods are called even in the JWT flow. When a leadership transition happens, the new server looks for unused Vault accessors from state and tries to revoke them. Similarly, the RevokeTokens method is called every time the Node.UpdataStatus and Node.UpdateAlloc RPCs are made by clients, as the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault, calling RevokeTokens and MarkForRevocation is not able to complete without a Vault address and token, so this commit changes the logic to use the no-op Vault client when one of them is not passed. It also updates the client itself to not error if these methods are called, but rather raise a warning log so operators can be made aware that there are Vault tokens created by Nomad that have not been force-expired.

There are also updates to the documentation of the migration process. When migrating an existing cluster to the new workload identity based flow, Nomad operators must first upgrade the Nomad version without removing any of the existing Vault configuration. Doing so can prevent Nomad servers from managing and cleaning-up existing Vault tokens during a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a vault block so they are updated with an identity for Vault. Skipping this step may cause allocations to fail if their Vault token expires (if, for example, the Nomad client stops running for TTL/2) or if they are rescheduled, since the new client will try to follow the legacy flow which will fail if the Nomad server configuration for Vault has already been updated to remove the Vault address and token.

nomad/server.go

nomad/vault_noop.go

lgfa29 · 2024-01-09T19:55:18Z

website/content/docs/integrations/vault/acl.mdx

+* Resubmit Nomad jobs that need access to Vault to redeploy them with a new
+  workload identity for Vault.


I was trying to think of ways we could help with this. Two ideas I had were:

A command that looks for the following artifacts and tell users if their clusters are ready for the migration.

jobs with vault blocks (and maybe templates with Vault-related functions? A vault block is not always required to access Vault) but no Vault identities.

unused VaultAccessors in state store (and potentially clean them without needing a leadership transition)

Some kind of metric that operators can monitor and wait until it reaches zero. But I'm not sure yet exactly what to measure.

I like the idea of a command... maybe we could bake this into a -check flag for nomad setup vault?

A metric would be nice, but the only reasonable place for it would be in the state store and we'd have to repopulate it during snapshot restore. Kind of messy.

Cool! I will try to think a bit more where to place the command. I was thinking somewhere under nomad operator to indicate that this may require a management token since it needs to look all over the state store.

It could also be used for other upgrade checks in the future.

tgross

LGTM once the comments I've made are resolved.

nomad/server.go

nomad/vault_noop.go

tgross · 2024-01-09T20:17:07Z

website/content/docs/integrations/vault/acl.mdx

+* Resubmit Nomad jobs that need access to Vault to redeploy them with a new
+  workload identity for Vault.


I like the idea of a command... maybe we could bake this into a -check flag for nomad setup vault?

A metric would be nice, but the only reasonable place for it would be in the state store and we'd have to repopulate it during snapshot restore. Kind of messy.

When transitioning from the legacy token-based workflow to the new JWT workflow for Vault the previous code would instantiate a no-op Vault if the server configuration had a `default_identity` block. This no-op client returned an error for some of its operations were called, such as `LookupToken` and `RevokeTokens`. The original intention was that, in the new JWT workflow, none of these methods should be called, so returning an error could help surface potential bugs. But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even in the JWT flow. When a leadership transition happens, the new server looks for unused Vault accessors from state and tries to revoke them. Similarly, the `RevokeTokens` method is called every time the `Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as the Nomad server tries to find unused Vault tokens for the node/alloc. Since the new JWT flow does not require Nomad servers to contact Vault, calling `RevokeTokens` and `MarkForRevocation` is not able to complete without a Vault token, so this commit changes the logic to use the no-op Vault client when no token is configured. It also updates the client itself to not error if these methods are called, but to rather just log so operators can be made aware that there are Vault tokens created by Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based flow, Nomad operators must first upgrade the Nomad version without removing any of the existing Vault configuration. Doing so can prevent Nomad servers from managing and cleaning-up existing Vault tokens during a leadership transition and node or alloc updates. Operators must also resubmit all jobs with a `vault` block so they are updated with an `identity` for Vault. Skipping this step may cause allocations to fail if their Vault token expires (if, for example, the Nomad client stops running for TTL/2) or if they are rescheduled, since the new client will try to follow the legacy flow which will fail if the Nomad server configuration for Vault has already been updated to remove the Vault address and token.

Lord-Y · 2024-01-15T15:40:47Z

We just hit this problem. We had rollback our config without workload identity. We'll wait for the next release.

lgfa29 added the backport/1.7.x backport to 1.7.x release line label Jan 9, 2024

vercel bot deployed to Preview – nomad-storybook-and-ui January 9, 2024 19:43 View deployment

lgfa29 commented Jan 9, 2024

View reviewed changes

nomad/server.go Outdated Show resolved Hide resolved

lgfa29 commented Jan 9, 2024

View reviewed changes

nomad/vault_noop.go Outdated Show resolved Hide resolved

lgfa29 force-pushed the b-fix-vault-noop-client branch from 89afc21 to 9d394e4 Compare January 9, 2024 19:50

vercel bot deployed to Preview – nomad-storybook-and-ui January 9, 2024 19:53 View deployment

lgfa29 commented Jan 9, 2024

View reviewed changes

lgfa29 requested review from pkazmierczak and tgross January 9, 2024 19:56

tgross approved these changes Jan 9, 2024

View reviewed changes

lgfa29 force-pushed the b-fix-vault-noop-client branch from 9d394e4 to 2dac161 Compare January 10, 2024 01:47

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 01:51 View deployment

lgfa29 force-pushed the b-fix-vault-noop-client branch from 2dac161 to 984f7fc Compare January 10, 2024 03:07

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 03:10 View deployment

lgfa29 added 3 commits January 9, 2024 22:17

changelog: add entry for #19689

315acaf

lgfa29 force-pushed the b-fix-vault-noop-client branch from 984f7fc to 315acaf Compare January 10, 2024 03:17

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 03:20 View deployment

vault: remove unused variable

3aa0561

vercel bot deployed to Preview – nomad-storybook-and-ui January 10, 2024 17:15 View deployment

lgfa29 merged commit 5267eec into main Jan 10, 2024
21 checks passed

lgfa29 deleted the b-fix-vault-noop-client branch January 10, 2024 18:28

hc-github-team-nomad-core mentioned this pull request Jan 10, 2024

Backport of vault: fix token revocation during workflow migration into release/1.7.x #19694

Merged

lgfa29 mentioned this pull request Jan 11, 2024

vault: remove revoked Vault accessors from state #19706

Merged

hc-github-team-nomad-core mentioned this pull request Jan 11, 2024

Backport of vault: remove revoked Vault accessors from state into release/1.7.x #19719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vault: fix token revocation during workflow migration #19689

vault: fix token revocation during workflow migration #19689

lgfa29 commented Jan 9, 2024

lgfa29 Jan 9, 2024

tgross Jan 9, 2024

lgfa29 Jan 10, 2024

tgross left a comment

tgross Jan 9, 2024

Lord-Y commented Jan 15, 2024

		* Resubmit Nomad jobs that need access to Vault to redeploy them with a new
		workload identity for Vault.

vault: fix token revocation during workflow migration #19689

vault: fix token revocation during workflow migration #19689

Conversation

lgfa29 commented Jan 9, 2024

lgfa29 Jan 9, 2024

Choose a reason for hiding this comment

tgross Jan 9, 2024

Choose a reason for hiding this comment

lgfa29 Jan 10, 2024

Choose a reason for hiding this comment

tgross left a comment

Choose a reason for hiding this comment

tgross Jan 9, 2024

Choose a reason for hiding this comment

Lord-Y commented Jan 15, 2024