Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSA key lost on leader election (1.7.0) #19340

Closed
sbihel opened this issue Dec 7, 2023 · 12 comments · Fixed by #19350
Closed

RSA key lost on leader election (1.7.0) #19340

sbihel opened this issue Dec 7, 2023 · 12 comments · Fixed by #19350

Comments

@sbihel
Copy link

sbihel commented Dec 7, 2023

Nomad version

v1.7.0-rc.1

Operating system and Environment details

Ubuntu 22.04
Consul v1.17.0

Issue

I am deploying a fresh cluster and setting up workload identities for Consul and Vault. Bootstrapped the cluster with Nomad v1.7.0-beta.2 but now all the nodes are using v1.7.0-rc.1.

By default, the root keyring contained a single ed25519 key. I then triggered a full rotation to generate an RSA key, but then encountered a few issues:

  1. with multiple keys in the keyring, it appears Nomad picks one at random (?), which results in many failed to verify id token signature from Consul;
  2. the old key never gets out of the rekeying state, even though I don't appear to use variables; and
  3. after replacing the servers (hoping to address 2.), the new key is now an ed25519 key even though it kept the same UUID.

Reproduction steps

Deploy a fresh cluster using the latest Nomad version. Perhaps using v1.7.0-beta.2 at first caused the server to enter a bad state -- I can try to re-bootstrap the cluster if you think this might be the root cause.

Expected Result

  1. Root key rotation is applied permanently.
  2. Root key on a fresh cluster should probably be RS256 directly.

Actual Result

  1. Root keyring rotation does not stick.
  2. Rekeying never stops.

Nomad Server logs (if appropriate)

When a new server joins the cluster, the only relevant logs appear to be:

[WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error=<nil>
[ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error="rpc error: no such key \"128ba7c1-baa0-3bc6-c20f-833b97a1fbe2\" in keyring"
[ERROR] nomad.keyring.replicator: failed to fetch key from any peer: rpc error: no such key "128ba7c1-baa0-3bc6-c20f-833b97a1fbe2" in keyring: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2

But the 128ba7c1 key was long removed. (I had previously done a non--full rotation.)

Nomad Client logs (if appropriate)

Error logs related to the id token verification by Consul:

msg=
| failed to setup alloc: pre-run hook "consul" failed: 2 errors occurred:
| \t* failed to derive Consul token for service my-service: Unexpected response code: 500 (rpc error making call: error verifying token: failed to verify id token signature)
| \t* failed to derive Consul token for task my-service: Unexpected response code: 500 (rpc error making call: rpc error making call: error verifying token: failed to verify id token signature)
|
failed=true
@sbihel sbihel added the type/bug label Dec 7, 2023
@tgross tgross self-assigned this Dec 7, 2023
@tgross
Copy link
Member

tgross commented Dec 7, 2023

Hi @sbihel! Thanks for this report. The 1.7.0 GA has just been published, unfortunately, so I'll want to get a fast-follow patch together if I can track this down. A couple questions that might help me narrow down the behavior more quickly:

  • Did the allocations that are now failing exist before the upgrade? Allocations are supposed to work across upgrades without breaking, but that's always a challenging bit.
  • Was the keyring rotation before or after upgrading to 1.7.0-beta.2? Or did the upgrade happen when you replaced the servers?
  • Are the "failed to fetch key" errors from after replacing the servers or after the keyring rotation?

@sbihel
Copy link
Author

sbihel commented Dec 7, 2023

Of course @tgross:

  • They did not exist before. I have tried to stop and purge the job multiple times to see if it would solve any issues after rotating the keys and updating the auth method on Consul, but not.
  • I believe I rotated the key after introducing rc.1 -- which I did by replacing a server (instead of upgrading an existing one).
  • They're from replacing the servers.

@tgross
Copy link
Member

tgross commented Dec 7, 2023

Hi @sbihel! So it looks like I've got a fix in #19350 for the signing validation problem you're seeing. The bug was that the GetKey RPC that we use to replicate keys after rotation was not including the RSA key. So only the ed25519 key was being distributed. The leader signs all requests with the RS256 algorithm, so whenever the Consul auth method hit a follower which was missing the key it'd fail to validate.

Trying to replicate a non-existing key and not cleaning up rekeying keys appears to be a separate problem. I'm still investigating that but the signing bug is the most critical.

tgross added a commit that referenced this issue Dec 7, 2023
When we added a RSA key for signing Workload Identities, we added it to the
keystore serialization but did not also add it to the `GetKey` RPC. This means
that when a key is rotated, the RSA key will not come along. The Nomad leader
signs all Workload Identities, but external consumers of WI (like Consul or
Vault) will verify the WI against any of the servers. If the request to verify
hits a follower, the follower will not have the RSA private key and cannot use
the existing ed25519 key to verify WIs with the `RS256` algorithm.

Add the RSA key material to the `GetKey` RPC.

Also remove an extraneous write to disk that happens for each key each time we
restart the Nomad server.

Fixes: #19340
@sbihel
Copy link
Author

sbihel commented Dec 7, 2023

Interesting, but I checked on all servers that .well-known/jwks.json would return the new RSA key. Would that contradict your fix? If it doesn't, that would explain the ratio of failure rate

@tgross
Copy link
Member

tgross commented Dec 7, 2023

Ah, good catch! But as it turns out /.well-known/jwks.json gets turned into the Keyring.ListPublic RPC, and that's forwarded to the leader! So the mechanism is actually much simpler, and I can reproduce this exactly without doing upgrades or key rotation:

  • Run 3 servers
  • curl nomad.example.com:4646/.well-known/jwks.json that points to any server, see the RSA key
  • Restart the leader so that a leader election happens
  • curl nomad.example.com:4646/.well-known/jwks.json, see the ed25519 key because the leader doesn't have the RSA key from the old leader

So that's a slightly different story than described in #19350 but the same result and same fix. I've updated the PR description.

@tgross tgross added this to the 1.7.x milestone Dec 7, 2023
@tgross tgross changed the title Root keyring rotation gets reverted RSA key lost on leader election (1.7.0) Dec 7, 2023
tgross added a commit that referenced this issue Dec 7, 2023
When we added a RSA key for signing Workload Identities, we added it to the
keystore serialization but did not also add it to the `GetKey` RPC. This means
that when a key is rotated, the RSA key will not come along. The Nomad leader
signs all Workload Identities, but external consumers of WI (like Consul or
Vault) will verify the WI against any of the servers. If the request to verify
hits a follower, the follower will not have the RSA private key and cannot use
the existing ed25519 key to verify WIs with the `RS256` algorithm.

Add the RSA key material to the `GetKey` RPC.

Also remove an extraneous write to disk that happens for each key each time we
restart the Nomad server.

Fixes: #19340
@tgross
Copy link
Member

tgross commented Dec 7, 2023

I've merged #19350 and that'll get shipped in Nomad 1.7.1 as soon as feasible. I've renamed this issue to make it easier for folks to find if they've been caught out by it.

@tgross
Copy link
Member

tgross commented Dec 7, 2023

I've also opened #19367 and #19368 to follow-up on the other two keyring-related behaviors you've described here. I'll follow-up with those separately.

@sbihel
Copy link
Author

sbihel commented Dec 7, 2023

Brilliant, thank you.

@roman-vynar
Copy link
Contributor

roman-vynar commented Dec 8, 2023

Same thing, upgraded nomad from 1.3.5 to 1.7.0 and it got stuck with the key problem:

    2023-12-08T13:09:27.671Z [WARN]  nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=90c514a7-7612-ea5e-52f0-6982db90a995 error="rpc error: Permission denied"
    2023-12-08T13:09:27.674Z [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=90c514a7-7612-ea5e-52f0-6982db90a995 error="rpc error: Permission denied"
    2023-12-08T13:09:27.674Z [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: rpc error: Permission denied: key=90c514a7-7612-ea5e-52f0-6982db90a995

Rotation, full rotation didn't help.

Fixed by manually copying /nomad/data/server/keystore/90c514a7-7612-ea5e-52f0-6982db90a995.nks.json from the leader to other servers and restarting nomad.

@tgross tgross pinned this issue Dec 8, 2023
@tgross tgross modified the milestones: 1.7.x, 1.7.1 Dec 8, 2023
@tgross
Copy link
Member

tgross commented Dec 8, 2023

We've shipped a patch for this bug in Nomad 1.7.1.

@kaspergrubbe
Copy link

@tgross Thank you so much for fixing this. I am 99% sure that I was hit by this bug, and ended up rebuilding my whole cluster, and then I find this.

It's completely my fault for missing this in the release notes. But would it be possible to be a bit more loud regarding these footguns in the changelog in the future? And maybe link to a spelled out guide on how to fix it? (maybe it is out there, but I didn't find it).

Copy link

github-actions bot commented Jan 2, 2025

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants