-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSA key lost on leader election (1.7.0) #19340
Comments
Hi @sbihel! Thanks for this report. The 1.7.0 GA has just been published, unfortunately, so I'll want to get a fast-follow patch together if I can track this down. A couple questions that might help me narrow down the behavior more quickly:
|
Of course @tgross:
|
Hi @sbihel! So it looks like I've got a fix in #19350 for the signing validation problem you're seeing. The bug was that the Trying to replicate a non-existing key and not cleaning up |
When we added a RSA key for signing Workload Identities, we added it to the keystore serialization but did not also add it to the `GetKey` RPC. This means that when a key is rotated, the RSA key will not come along. The Nomad leader signs all Workload Identities, but external consumers of WI (like Consul or Vault) will verify the WI against any of the servers. If the request to verify hits a follower, the follower will not have the RSA private key and cannot use the existing ed25519 key to verify WIs with the `RS256` algorithm. Add the RSA key material to the `GetKey` RPC. Also remove an extraneous write to disk that happens for each key each time we restart the Nomad server. Fixes: #19340
Interesting, but I checked on all servers that |
Ah, good catch! But as it turns out
So that's a slightly different story than described in #19350 but the same result and same fix. I've updated the PR description. |
When we added a RSA key for signing Workload Identities, we added it to the keystore serialization but did not also add it to the `GetKey` RPC. This means that when a key is rotated, the RSA key will not come along. The Nomad leader signs all Workload Identities, but external consumers of WI (like Consul or Vault) will verify the WI against any of the servers. If the request to verify hits a follower, the follower will not have the RSA private key and cannot use the existing ed25519 key to verify WIs with the `RS256` algorithm. Add the RSA key material to the `GetKey` RPC. Also remove an extraneous write to disk that happens for each key each time we restart the Nomad server. Fixes: #19340
I've merged #19350 and that'll get shipped in Nomad 1.7.1 as soon as feasible. I've renamed this issue to make it easier for folks to find if they've been caught out by it. |
Brilliant, thank you. |
Same thing, upgraded nomad from 1.3.5 to 1.7.0 and it got stuck with the key problem:
Rotation, full rotation didn't help. Fixed by manually copying |
We've shipped a patch for this bug in Nomad 1.7.1. |
@tgross Thank you so much for fixing this. I am 99% sure that I was hit by this bug, and ended up rebuilding my whole cluster, and then I find this. It's completely my fault for missing this in the release notes. But would it be possible to be a bit more loud regarding these footguns in the changelog in the future? And maybe link to a spelled out guide on how to fix it? (maybe it is out there, but I didn't find it). |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
v1.7.0-rc.1
Operating system and Environment details
Ubuntu
22.04
Consul
v1.17.0
Issue
I am deploying a fresh cluster and setting up workload identities for Consul and Vault. Bootstrapped the cluster with Nomad
v1.7.0-beta.2
but now all the nodes are usingv1.7.0-rc.1
.By default, the root keyring contained a single ed25519 key. I then triggered a full rotation to generate an RSA key, but then encountered a few issues:
failed to verify id token signature
from Consul;rekeying
state, even though I don't appear to use variables; andReproduction steps
Deploy a fresh cluster using the latest Nomad version. Perhaps using
v1.7.0-beta.2
at first caused the server to enter a bad state -- I can try to re-bootstrap the cluster if you think this might be the root cause.Expected Result
Actual Result
Nomad Server logs (if appropriate)
When a new server joins the cluster, the only relevant logs appear to be:
But the
128ba7c1
key was long removed. (I had previously done a non--full
rotation.)Nomad Client logs (if appropriate)
Error logs related to the id token verification by Consul:
The text was updated successfully, but these errors were encountered: