-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS KMS Server Keymanager #2066
Conversation
Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Andres Gomez Coronel <[email protected]>
- adds alias usage - adds error handling and logging - major refactoring Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Andres Gomez Coronel <[email protected]>
- Tests converted into table driven - Changes to use test suite assertions Signed-off-by: Maximiliano Churichi <[email protected]>
- adds pagination to aliases listing - adds optional key prefix to configuration - makes access_key_id and secret_access_key optional - fixes weird case of aliases without a key. - more logging Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Maximiliano Churichi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
Hi @kunzimariano ! I apologize for the latency reviewing this PR. We're heads down preparing for the 1.0 release but should be able to get to this in the next week or so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @kunzimariano! It's great to see this coming!
I have some initial comments.
|
||
if !hasOldEntry { | ||
//create alias | ||
_, err = p.kmsClient.CreateAliasWithContext(ctx, &kms.CreateAliasInput{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's safe to assume that p.kmsClient will not be nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the call to BuiltIn()
should guarantee to not be nil. If that isn't enough, would checking for nil and returning an error be enough?
Signed-off-by: Mariano Kunzi <[email protected]>
Is it possible to get this in by 1.0? It is critical for getting it into production in infrastructure that require signing certificates to be stored in a KMS. It is actively blocking my ability to run in production environments. I am concerned that there may be a long delay before a 1.1 release which may throw off some of my timelines. :) |
Or if 1.0 is a long way out, I'd be happy with a 0.12 point release. ;) |
Hi @fkautz, after reevaluating the 1.0 release plan we decided to have a 0.12 feature release and this PR is planned to be included in that release. |
Fantastic, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kunzimariano , thanks for sending this! The prospect of KMS support in SPIRE is already getting me excited.
There's quite a few changes here, so I'm still taking it all in. Full disclaimer, I didn't get a chance to review all the test code just yet, but overall I think this is off to a good start.
I left you a few comments about some implementation details. I had a couple primary takeaways from the first review of the PR:
- We should see if we can be a little more economical in our call patterns of AWS APIs to avoid adding significant delays to SPIRE Server startup time and to prevent unnecessary API calls.
- I really appreciate all the effort you've gone to so far to add robust tests. I noticed the current coverage is about 77%, ideally we should try to get this to be as close to 100% as is feasible.
Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback @rturner3! I think I addressed almost all the comments. I will address the plugin init optimization + test improvements and coverage on the next commit.
- Fixed up gRPC service definition import for v0 KeyManager - Removed error prefixing (which is handled universally by the v0 shim) Signed-off-by: Andrew Harding <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
} | ||
p.notifyDeleteDone() | ||
backoff = min(backoff*2, backoffMax) | ||
time.Sleep(backoff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an actual sleep here is going to 1) keep this from responding to context cancellation in a timely manner, 2) make it harder to test...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backoff as applied also ends up applying the backoff to all deletion operations including those for unrelated keyArn's
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for hanging in there @kunzimariano, I know it's been tough getting this change through.
I see it's been reviewed several times already. It is also currently blocking 0.12.2 release which is already overdue. I'd really like to see it merged sooner rather than later so we can get this show on the road. I left many comments, but I think it will be best to address many of them in a follow-on PR. I can personally address the logging-related comments after this is merged. I mostly left these comments for everyone else's benefit, so folks can see the logic behind choosing a "good" log message.
Below are the items that I think are most important to get attention in this PR. If it doesn't appear in the list below, then I don't consider it blocking for this PR or for release.
- Agreement on configurable name (see docs comment)
- Clarification on support for nested topologies, what are the interactions there (if any)
- Clarification on alias + key delete logic
- Threshold
- Alias/description matching and collision resistance
| access_key_id | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS | Value of AWS_ACCESS_KEY_ID environment variable | | ||
| secret_access_key | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS | Value of AWS_SECRET_ACCESS_KEY environment variable | | ||
| region | string | yes | The region where the keys will be stored | | | ||
| server_id_file_path | string | yes | A file path location where the server id will be persisted | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a section in the docs describing what this is and what it's used for
| access_key_id | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS | Value of AWS_ACCESS_KEY_ID environment variable | | ||
| secret_access_key | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS | Value of AWS_SECRET_ACCESS_KEY environment variable | | ||
| region | string | yes | The region where the keys will be stored | | | ||
| server_id_file_path | string | yes | A file path location where the server id will be persisted | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This suggested change is both technically accurate and less confusing when read in the context of a key manager configuration. It also gives the reader a hint at what it's used for and what might happen if it's lost.
When I read "location where the server id will be persisted", I wondered "what's a server id" , "can I not set it?", "what is it used for?", "why does it have to be a file", etc... I think a change similar to the one below avoids a lot of that. The name of the configurable too could probably benefit from an update like key_metadata_file
| server_id_file_path | string | yes | A file path location where the server id will be persisted | | | |
| server_id_file_path | string | yes | A file path location where information about generated keys will be persisted | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key_metadata_file
sounds much better to me. I also think we need to describe at a high level what its used for, the implications of what happens when it is lost, and how we recover, clean up our own mess. We should probably include a section in the documentation that elaborates on our purge strategy. These are the kinds of these we end up answering in slack over and over.
now := p.hooks.clk.Now() | ||
diff := now.Sub(*alias.LastUpdatedDate) | ||
if diff < aliasThreshold { | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the alias hasn't been updated in the last 48 hours, we delete it and schedule deletion of the referenced key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 but only if the alias belong to the current trust domain with exception of the current server.
case alias.AliasName == nil || alias.LastUpdatedDate == nil || alias.AliasArn == nil: | ||
continue | ||
// if alias does not belong to trust domain skip | ||
case !strings.HasPrefix(*alias.AliasName, p.aliasPrefixForTrustDomain()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this support the case of a nested deployment (single trust domain) in which multiple independent clusters are using KMS? Servers in "this" cluster will be deleting/managing keys in "that" cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I understand the "this" and "that" analogy. If all those servers are running under the same trust domain, the same AWS account, and the same AWS region then yes.
describeResp, err := p.kmsClient.DescribeKey(ctx, &kms.DescribeKeyInput{KeyId: alias.AliasArn}) | ||
switch { | ||
case err != nil: | ||
log.Error("Failed to describe key to dispose", reasonTag, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message, and most others in this package, would greatly benefit from more clarity. It is useful to think of what the operator experience would be when encountering such a message, and what information is useful.
In this case, I think the useful information is "we could not remove an old key that we wanted to remove". Along with that, it is also useful to know why, or the exact error encountered. This is the case for most of the errors we log in this function. Compare this to information like "failed to describe", "failed to fetch", "malformed", etc... those are more reasons, and are supplementary.
For example, this log could read "Failed to clean up old KMS keys", reason err.
The log below could similarly be "Failed to clean up old KMS keys", reason "missing data in AWS API describe response"
These messages are straight to the point - if I read "failed to clean up old KMS keys", I know exactly what that means and what impact it might have. I know to go check AWS. Compare to "failed to describe key to dispose" - were we still able to delete it? Why are we disposing of it? Etc.
for { | ||
select { | ||
case <-ctx.Done(): | ||
p.log.Debug("Stopping dispose keys task", reasonTag, ctx.Err()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These kinds of log messages feel a little over the top. I say that because they're useful to developers, but not very useful to an operator. If the context is cancelled, there will be lots of other stuff in the log, and what the dispose keys task is doing doesn't feel very important to know (even if it's debug log)
AliasLastUpdatedDate: &unixEpoch, | ||
}, | ||
{ | ||
AliasName: aws.String("alias/another_td/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee/id_04"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are we matching on when we dispose of keys? Disposing of a KMS key is of course a very sensitive and error-prone operation, we want to be extra sure that we're only deleting the ones we created and manage. I think we should add some test cases here, and perhaps some more verification in the plugin code.
I may have misread, but it seems like we're only looking for trust domain name in the alias... and maybe a static alias/
prefix. Trust domain name is often something with meaning beyond just SPIFFE/SPIRE ... for example, the trust domain name may be the name of the company e.g. acme.co
. In that case, it's easy to imagine that we're not too far off from a name collision and possibly deleting keys that aren't ours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great observation. Aliases now look like this: alias/SPIRE_SERVER/{TRUST_DOMAIN}/{SERVER_ID}/{KEY_ID}
. The prefix 'SPIRE_SERVER' was added with the intention of making collisions less likely.
Quick clarification: My comment about alias + key deletion, and threshold, is looking for clarification on exactly what conditions must be met before we decide that an alias + key should be deleted. I see some threshold logic in there. Code comment with this info might be a good idea since the logic is spread across multiple functions |
I've just merged #2180. Consequently, you'll need to touch up the import for the keymanager definitions. Old import: If you'd rather me touch it up and push a commit, I'm happy to oblige. Just let me know. |
Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
- Added documentation about the usage of aliases and the key metadata file - Changed the alias threshold from 48 hours to two weeks - Improved some logging of errors Signed-off-by: Agustín Martínez Fayó <[email protected]>
@kunzimariano During the last maintainers call we discussed this PR and realized that the alias threshold of 48 hours would be too aggressive and a time around two weeks would be more appropriate. I pushed a commit to address that and a few other things:
|
Signed-off-by: Andrew Harding <[email protected]>
Thanks @amartinezfayo! All those changes seem reasonable. |
Signed-off-by: Andrew Harding <[email protected]>
Signed-off-by: Andrew Harding <[email protected]>
I just pushed another commit this morning that:
|
Signed-off-by: Andrew Harding <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big thanks @kunzimariano and everyone else for this contribution!
Signed-off-by: Andrew Harding <[email protected]>
Changes were addressed in previous commits
Signed-off-by: Mariano Kunzi <[email protected]>
This PR introduces a new server keymanager plugin that uses AWS KMS. There is a high level explanation of it in the next issue.
Some considerations:
key_prefix
configuration option is a a simple and quick way of preventing key overlapping when servers share the same region. I'm curious to hear if a more elaborated solution is desired.At this point if key deletion scheduling fails an error is logged, but a retry won't be attempted. This doesn't affect the plugin functionality, but it leaves unused keys which comes with a (low) cost in money. It's easy to find those keys, and delete them outside the plugin, but I think a more robust approach could be better. I considered pushing the key to delete into a queue/channel, and letting a long running goroutine handle it with retries in case of failure. Since I'm not aware of any mechanism for signaling the plugins, and letting them know that the server is shutting down (with the intention of stopping the before mentioned gorutine), I'm reluctant to add this functionality. Any feedback on this topic is appreciated.Logs might be a little bit verbose at this point. Happy to dial it back.Fixes #1921