Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyring: keys stuck in rekeying after rotate -full #19368

Closed
tgross opened this issue Dec 7, 2023 · 4 comments · Fixed by #23577
Closed

keyring: keys stuck in rekeying after rotate -full #19368

tgross opened this issue Dec 7, 2023 · 4 comments · Fixed by #23577
Assignees
Milestone

Comments

@tgross
Copy link
Member

tgross commented Dec 7, 2023

In #19340 @sbihel reported a behavior where keys would get stuck in rekeying after running nomad operator root keyring rotate -full. I was able to confirm this and build a quick unit test to cover it, but I don't yet have a fix for the underlying problem.

#19340 covered another critical bug and was automatically closed once the fix was merged. This issue is a follow-up.

@tgross
Copy link
Member Author

tgross commented Dec 7, 2023

Note that following Nomad 1.7.0, we don't easily know when it's safe to delete keys anymore. We don't have the WI in the state store so there's no easy way to map from a set of allocations with WI's signed by a given key to that key. We do have the SignedIdentities map in the allocation, but reading this would involve unpacking all the JWTs in that map for each allocation we want to check.

@tgross
Copy link
Member Author

tgross commented Jan 9, 2024

Ref #19669

@tgross tgross modified the milestones: 1.7.x, 1.8.x Jun 4, 2024
@tgross tgross self-assigned this Jul 10, 2024
tgross added a commit that referenced this issue Jul 12, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
tgross added a commit that referenced this issue Jul 19, 2024
When a root key is rotated, the servers immediately start signing Workload
Identities with the new active key. But workloads may be using those WI tokens
to sign into external services, which may not have had time to fetch the new
public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the
JWKS endpoint but will not be used for signing or encryption until their
`PublishTime`. Update the periodic key rotation to prepublish keys at half the
`root_key_rotation_threshold` window, and promote prepublished keys to active
after the `PublishTime`.

This changeset also fixes two bugs in periodic root key rotation and garbage
collection, both of which can't be safely fixed without implementing
prepublishing:

* Periodic root key rotation would never happen because the default
  `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM
  time table. We now compare the `CreateTime` against the wall clock time instead
  of the time table. (We expect to remove the time table in future work, ref
  #16359)
* Root key garbage collection could GC keys that were used to sign
  identities. We now wait until `root_key_rotation_threshold` +
  `root_key_gc_threshold` before GC'ing a key.
* When rekeying a root key, the core job did not mark the key as inactive after
  the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368
@tgross tgross closed this as completed in 2f43534 Jul 19, 2024
tgross added a commit that referenced this issue Jul 19, 2024
…23651)

When a root key is rotated, the servers immediately start signing Workload Identities with the new active key. But workloads may be using those WI tokens to sign into external services, which may not have had time to fetch the new public key and which might try to fetch new keys as needed.

Add support for prepublishing keys. Prepublished keys will be visible in the JWKS endpoint but will not be used for signing or encryption until their `PublishTime`. Update the periodic key rotation to prepublish keys at half the `root_key_rotation_threshold` window, and promote prepublished keys to active after the `PublishTime`.

This changeset also fixes three bugs in periodic root key rotation and garbage collection, none of which can be safely fixed without implementing prepublishing:

* Periodic root key rotation would never happen because the default `root_key_rotation_threshold` of 720h exceeds the 72h maximum window of the FSM time table. We now compare the `CreateTime` against the wall clock time instead of the time table. (We expect to remove the time table in future work, ref #16359)
* Root key garbage collection could GC keys that were used to sign identities. We now wait until `root_key_rotation_threshold` + `root_key_gc_threshold` before GC'ing a key.
*  When rekeying a root key, the core job did not mark the key as inactive after the rekey was complete.

Ref: https://hashicorp.atlassian.net/browse/NET-10398
Ref: https://hashicorp.atlassian.net/browse/NET-10280
Fixes: #19669
Fixes: #23528
Fixes: #19368

Co-authored-by: Tim Gross <[email protected]>
@tgross tgross modified the milestones: 1.8.x, 1.8.3 Jul 19, 2024
@tgross
Copy link
Member Author

tgross commented Jul 19, 2024

Fixed in #23577 and will ship in the next regular release of Nomad 1.8.x, with backports to Nomad 1.7.x/1.6.x Enterprise.

Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant