Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

identity: default to RS256 for new workload ids #18882

Merged
merged 23 commits into from
Oct 31, 2023
Merged

identity: default to RS256 for new workload ids #18882

merged 23 commits into from
Oct 31, 2023

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Oct 27, 2023

OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.

Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.

Test Updates

Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.

I added a new testutil.WaitForKeyring helper to replace testutil.WaitForLeader in cases where the keyring must be initialized before the test may continue. However this is mostly used in the nomad/ package.

In the api and command/agent packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.

TODO

  • Docs and changelog entries.
  • Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
  • Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.

Requiem for ed25519

When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.

EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:

  1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
  2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
  3. No parameters. No choices of encodings. It's all well-defined by RFC 8032.
  4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
  5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.

Life was good with ed25519, but sadly it could not last.

IDPs, such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:

RS256 only:

OIDC mandates the support of the RS256 signing algorithm so in order to
maximize workload identity's usefulness this change switches from using
the EdDSA signing algorithm to RS256.

Old keys will continue to use EdDSA but new keys will use RS256. The
EdDSA generation code was left in place because it's fast and cheap and
I'm not going to lie I hope we get to use it again.

**TODO**

- [ ] Tests - 🔜
- [ ] Upgrades - right now upgrades won't get RS256 keys until their root
  key rotates either manually or after ~30 days.
- [ ] Observability - I'm not sure there's a way for operators to see if
  they're using EdDSA or RS256 unless they inspect a key. The JWKS
  endpoint can be inspected to see if EdDSA will be used for new
  identities, but it doesn't technically define which key is active. If
  upgrades can be fixed to automatically rotate keys, we probably don't
  need to worry about this.

**Requiem for ed25519**

When workload identities were first implemented we did not immediately
consider OIDC compliance. Consul, Vault, and many other third parties
support JWT auth methods without full OIDC compliance. For the
machine<-->machine use cases workload identity is intended to fulfill,
OIDC seemed like a bigger risk than asset.

EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs
because of all these lovely properties:

1. Deterministic keys that can be derived from our preexisting root
   keys. This was perhaps the biggest factor since we already had a root
   encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte
   signatures. Just glorious.
3. No parameters. No choices of encodings. Its all well-defined by RFC
   8032.
4. Fastest performing signing algorithm! We don't even care that much
   about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available.
   Not just from a cryptanalysis perspective, but from an API and usage
   perspective.

Life was good with ed25519, but sadly it could not last.

IDPs, such as AWS's IAM OIDC Provider, love OIDC. They have OIDC
implemented for humans, so why not reuse that OIDC support for machines
as well? Since OIDC mandates RS256, many implementations don't bother
implementing other signing algorithms (or at least not advertising their
support). A quick survey of OIDC Discovery endpoints revealed only 2 out
of 10 OIDC providers advertised support for anything other than RS256:

- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256

RS256 only:

- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
add test that mimics upgrading
testutil/wait.go Outdated
Comment on lines 156 to 161
// As of 1.7 the RSA key generation slows down keyring initialization so
// much that tests relying on the keyring being initialized fail unless
// they wait until the key was generated
var resp structs.KeyringListPublicResponse
err = rpc("Keyring.ListPublic", args, &resp)
return err == nil, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof. RSA key generation is so much slower than ed25519 that it causes heaps of tests to fail. Anything that expects the keyring to be initialized but doesn't explicitly wait for that to occur fails with a keyring has not been initialized yet error.

I wrote a quick benchmark for comparing rsa.GenerateKey vs ed25519.NewKeyFromSeed and got:

~/src/benchcrypto$ go test -bench .
goos: linux
goarch: amd64
pkg: github.com/schmichael/benchcrypto
cpu: 12th Gen Intel(R) Core(TM) i9-12900H
BenchmarkRSA-20               16          63629942 ns/op
BenchmarkEd25519-20       109570             10977 ns/op
PASS
ok      github.com/schmichael/benchcrypto       3.883s

That's 63ms to generate an RSA key vs 11 microseconds for ed25519.

Generating a new RSA key only occurs every 30 days, so the latency in production environments doesn't matter.

However this is a pretty big hassle for tests. Technically this was a race condition just waiting to cause flakiness, so maybe I shouldn't feel too bad?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that many of the tests that spin up the server and need to wait for the leader are tests that take on the order of seconds and not milliseconds anyways, so... 🤷 But it looks like in practice a lot of the test groups are taking a lot longer to run. But it's hard to tell for sure until they're all passing, because some of them will hang waiting until failure.

Maybe it makes sense to split out a testutil.WaitForKeyring method so that tests that care about the keyring can wait and those that don't (because they never create Allocations or Variables) can skip that wait?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to change something like TestConfigForServer to use Ed25519 by default since I think most tests don't care about the exact algorithm being used?

Or maybe preload test servers with hardcoded RSA material so they don't have to be computed all the time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm exploring the testutil.WaitForKeyring route first as the fact that our tests have hidden races in them is is going to bother me regardless. 😅

Luiz's options are good fallbacks (or could even be nice optimizations in addition to WaitForKeyring), but all fall into the category of widening the differences between test servers and reality. For 99.9% of tests that extra difference isn't worth worrying about, so I don't think it's a big concern.

Thanks for helping me brainstorm this!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good. It looks like we might have discovered a possible bug in locking based on the panic we're seeing in TestVariablesEndpoint_Apply_LockAcquireUpsertAndRelease, in addition to whatever other testing race conditions there are to mop up here.

Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key.

We could add a flag to the RootMeta struct that tells us if there's a RS256 key available for that keyset. That would show up in the nomad operator keyring root list command (and equivalent API) and would allow Keyring.List to continue to just read from state. Technically that only tells us that the key could be used for RS256, not that we're necessarily signing anything with it, but there's no user-configurable knob here to pick between the two I think?

Along those lines, having an option for identity blocks to sign identities with a specific key could let people who don't care about OIDC use only the ed25519 keys, but the identity block is in the hands of the job submitter and not the cluster admin so maybe that's a bad idea.

If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.

That would be cool! It feels like it'd be tricky to coordinate across leader elections during an upgrade.

testutil/wait.go Outdated
Comment on lines 156 to 161
// As of 1.7 the RSA key generation slows down keyring initialization so
// much that tests relying on the keyring being initialized fail unless
// they wait until the key was generated
var resp structs.KeyringListPublicResponse
err = rpc("Keyring.ListPublic", args, &resp)
return err == nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that many of the tests that spin up the server and need to wait for the leader are tests that take on the order of seconds and not milliseconds anyways, so... 🤷 But it looks like in practice a lot of the test groups are taking a lot longer to run. But it's hard to tell for sure until they're all passing, because some of them will hang waiting until failure.

Maybe it makes sense to split out a testutil.WaitForKeyring method so that tests that care about the keyring can wait and those that don't (because they never create Allocations or Variables) can skip that wait?

Comment on lines +299 to +302
// waitForServers waits for the Nomad server's HTTP API to become available,
// and then waits for the keyring to be intialized. This implies a leader has
// been elected and Raft writes have occurred.
func (s *TestServer) waitForServers() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the corresponding change in testutil/server.go I decided to switch all tests to wait for the keyring to be initialized.

I did this because I think these tests are more like integration tests and less concerned with the implementation details than nomad/ package tests. Therefore I thought it appropriate to let these tests block until servers are ready by default.

Comment on lines -82 to +80
err = msgpackrpc.CallWithCodec(codec, "Keyring.List", listReq, &listResp)
require.NoError(t, err)
errCh <- msgpackrpc.CallWithCodec(codec, "Keyring.List", listReq, &listResp)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to make the race detector happy.

logger.Error("could not initialize keyring: %v", err)
logger.Error("could not initialize keyring", "error", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated cosmetic fixups. Just noticed them in the logs when unbreaking tests.

@schmichael
Copy link
Member Author

Marked Ready for Review. I'll get a PR up tomorrow for a holistic set of changelog entries and docs for the identity{}/OIDC changes in 1.7.

and a weird double t.Parallel call
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the last couple of tests are fixed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants