Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS KMS Server Keymanager #2066

Merged
merged 33 commits into from
Apr 5, 2021
Merged

AWS KMS Server Keymanager #2066

merged 33 commits into from
Apr 5, 2021

Conversation

kunzimariano
Copy link
Member

@kunzimariano kunzimariano commented Jan 22, 2021

This PR introduces a new server keymanager plugin that uses AWS KMS. There is a high level explanation of it in the next issue.

Some considerations:

  • When it comes to retries and backoff, the plugin uses the default configuration offered by the AWS SDK. For the sake of simplicity this is not exposed through the plugin configuration, but it could be if necessary. For more details Error retries and exponential backoff in AWS, and default values.
  • The key_prefix configuration option is a a simple and quick way of preventing key overlapping when servers share the same region. I'm curious to hear if a more elaborated solution is desired.
  • At this point if key deletion scheduling fails an error is logged, but a retry won't be attempted. This doesn't affect the plugin functionality, but it leaves unused keys which comes with a (low) cost in money. It's easy to find those keys, and delete them outside the plugin, but I think a more robust approach could be better. I considered pushing the key to delete into a queue/channel, and letting a long running goroutine handle it with retries in case of failure. Since I'm not aware of any mechanism for signaling the plugins, and letting them know that the server is shutting down (with the intention of stopping the before mentioned gorutine), I'm reluctant to add this functionality. Any feedback on this topic is appreciated.
  • Logs might be a little bit verbose at this point. Happy to dial it back.

Fixes #1921

kunzimariano and others added 8 commits January 20, 2021 14:40
- adds alias usage
- adds error handling and logging
- major refactoring

Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Andres Gomez Coronel <[email protected]>
- Tests converted into table driven
- Changes to use test suite assertions

Signed-off-by: Maximiliano Churichi <[email protected]>
- adds pagination to aliases listing
- adds optional key prefix to configuration
- makes access_key_id and secret_access_key optional
- fixes weird case of aliases without a key.
- more logging

Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Maximiliano Churichi <[email protected]>
@kunzimariano kunzimariano changed the title Awskms AWS KMS Server Keymanager Jan 22, 2021
@azdagron
Copy link
Member

Hi @kunzimariano ! I apologize for the latency reviewing this PR. We're heads down preparing for the 1.0 release but should be able to get to this in the next week or so.

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kunzimariano! It's great to see this coming!
I have some initial comments.


if !hasOldEntry {
//create alias
_, err = p.kmsClient.CreateAliasWithContext(ctx, &kms.CreateAliasInput{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's safe to assume that p.kmsClient will not be nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that the call to BuiltIn() should guarantee to not be nil. If that isn't enough, would checking for nil and returning an error be enough?

Signed-off-by: Mariano Kunzi <[email protected]>
@fkautz
Copy link
Member

fkautz commented Feb 18, 2021

Is it possible to get this in by 1.0? It is critical for getting it into production in infrastructure that require signing certificates to be stored in a KMS.

It is actively blocking my ability to run in production environments. I am concerned that there may be a long delay before a 1.1 release which may throw off some of my timelines. :)

@fkautz
Copy link
Member

fkautz commented Feb 18, 2021

Or if 1.0 is a long way out, I'd be happy with a 0.12 point release. ;)

@rturner3 rturner3 self-assigned this Feb 18, 2021
@evan2645 evan2645 removed their assignment Feb 18, 2021
@amartinezfayo
Copy link
Member

Or if 1.0 is a long way out, I'd be happy with a 0.12 point release. ;)

Hi @fkautz, after reevaluating the 1.0 release plan we decided to have a 0.12 feature release and this PR is planned to be included in that release.

@fkautz
Copy link
Member

fkautz commented Feb 18, 2021

Fantastic, thank you!

Copy link
Collaborator

@rturner3 rturner3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kunzimariano , thanks for sending this! The prospect of KMS support in SPIRE is already getting me excited.

There's quite a few changes here, so I'm still taking it all in. Full disclaimer, I didn't get a chance to review all the test code just yet, but overall I think this is off to a good start.

I left you a few comments about some implementation details. I had a couple primary takeaways from the first review of the PR:

  • We should see if we can be a little more economical in our call patterns of AWS APIs to avoid adding significant delays to SPIRE Server startup time and to prevent unnecessary API calls.
  • I really appreciate all the effort you've gone to so far to add robust tests. I noticed the current coverage is about 77%, ideally we should try to get this to be as close to 100% as is feasible.

Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
Copy link
Member Author

@kunzimariano kunzimariano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback @rturner3! I think I addressed almost all the comments. I will address the plugin init optimization + test improvements and coverage on the next commit.

- Fixed up gRPC service definition import for v0 KeyManager
- Removed error prefixing (which is handled universally by the v0 shim)

Signed-off-by: Andrew Harding <[email protected]>
@azdagron azdagron added this to the 0.12.2 milestone Mar 25, 2021
Signed-off-by: Mariano Kunzi <[email protected]>
}
p.notifyDeleteDone()
backoff = min(backoff*2, backoffMax)
time.Sleep(backoff)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an actual sleep here is going to 1) keep this from responding to context cancellation in a timely manner, 2) make it harder to test...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backoff as applied also ends up applying the backoff to all deletion operations including those for unrelated keyArn's

Copy link
Member

@evan2645 evan2645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for hanging in there @kunzimariano, I know it's been tough getting this change through.

I see it's been reviewed several times already. It is also currently blocking 0.12.2 release which is already overdue. I'd really like to see it merged sooner rather than later so we can get this show on the road. I left many comments, but I think it will be best to address many of them in a follow-on PR. I can personally address the logging-related comments after this is merged. I mostly left these comments for everyone else's benefit, so folks can see the logic behind choosing a "good" log message.

Below are the items that I think are most important to get attention in this PR. If it doesn't appear in the list below, then I don't consider it blocking for this PR or for release.

  • Agreement on configurable name (see docs comment)
  • Clarification on support for nested topologies, what are the interactions there (if any)
  • Clarification on alias + key delete logic
    • Threshold
    • Alias/description matching and collision resistance

| access_key_id | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS | Value of AWS_ACCESS_KEY_ID environment variable |
| secret_access_key | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS | Value of AWS_SECRET_ACCESS_KEY environment variable |
| region | string | yes | The region where the keys will be stored | |
| server_id_file_path | string | yes | A file path location where the server id will be persisted | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a section in the docs describing what this is and what it's used for

| access_key_id | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS | Value of AWS_ACCESS_KEY_ID environment variable |
| secret_access_key | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS | Value of AWS_SECRET_ACCESS_KEY environment variable |
| region | string | yes | The region where the keys will be stored | |
| server_id_file_path | string | yes | A file path location where the server id will be persisted | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggested change is both technically accurate and less confusing when read in the context of a key manager configuration. It also gives the reader a hint at what it's used for and what might happen if it's lost.

When I read "location where the server id will be persisted", I wondered "what's a server id" , "can I not set it?", "what is it used for?", "why does it have to be a file", etc... I think a change similar to the one below avoids a lot of that. The name of the configurable too could probably benefit from an update like key_metadata_file

Suggested change
| server_id_file_path | string | yes | A file path location where the server id will be persisted | |
| server_id_file_path | string | yes | A file path location where information about generated keys will be persisted | |

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key_metadata_file sounds much better to me. I also think we need to describe at a high level what its used for, the implications of what happens when it is lost, and how we recover, clean up our own mess. We should probably include a section in the documentation that elaborates on our purge strategy. These are the kinds of these we end up answering in slack over and over.

now := p.hooks.clk.Now()
diff := now.Sub(*alias.LastUpdatedDate)
if diff < aliasThreshold {
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if the alias hasn't been updated in the last 48 hours, we delete it and schedule deletion of the referenced key?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 but only if the alias belong to the current trust domain with exception of the current server.

case alias.AliasName == nil || alias.LastUpdatedDate == nil || alias.AliasArn == nil:
continue
// if alias does not belong to trust domain skip
case !strings.HasPrefix(*alias.AliasName, p.aliasPrefixForTrustDomain()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this support the case of a nested deployment (single trust domain) in which multiple independent clusters are using KMS? Servers in "this" cluster will be deleting/managing keys in "that" cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the "this" and "that" analogy. If all those servers are running under the same trust domain, the same AWS account, and the same AWS region then yes.

describeResp, err := p.kmsClient.DescribeKey(ctx, &kms.DescribeKeyInput{KeyId: alias.AliasArn})
switch {
case err != nil:
log.Error("Failed to describe key to dispose", reasonTag, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message, and most others in this package, would greatly benefit from more clarity. It is useful to think of what the operator experience would be when encountering such a message, and what information is useful.

In this case, I think the useful information is "we could not remove an old key that we wanted to remove". Along with that, it is also useful to know why, or the exact error encountered. This is the case for most of the errors we log in this function. Compare this to information like "failed to describe", "failed to fetch", "malformed", etc... those are more reasons, and are supplementary.

For example, this log could read "Failed to clean up old KMS keys", reason err.

The log below could similarly be "Failed to clean up old KMS keys", reason "missing data in AWS API describe response"

These messages are straight to the point - if I read "failed to clean up old KMS keys", I know exactly what that means and what impact it might have. I know to go check AWS. Compare to "failed to describe key to dispose" - were we still able to delete it? Why are we disposing of it? Etc.

for {
select {
case <-ctx.Done():
p.log.Debug("Stopping dispose keys task", reasonTag, ctx.Err())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These kinds of log messages feel a little over the top. I say that because they're useful to developers, but not very useful to an operator. If the context is cancelled, there will be lots of other stuff in the log, and what the dispose keys task is doing doesn't feel very important to know (even if it's debug log)

AliasLastUpdatedDate: &unixEpoch,
},
{
AliasName: aws.String("alias/another_td/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee/id_04"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are we matching on when we dispose of keys? Disposing of a KMS key is of course a very sensitive and error-prone operation, we want to be extra sure that we're only deleting the ones we created and manage. I think we should add some test cases here, and perhaps some more verification in the plugin code.

I may have misread, but it seems like we're only looking for trust domain name in the alias... and maybe a static alias/ prefix. Trust domain name is often something with meaning beyond just SPIFFE/SPIRE ... for example, the trust domain name may be the name of the company e.g. acme.co. In that case, it's easy to imagine that we're not too far off from a name collision and possibly deleting keys that aren't ours

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great observation. Aliases now look like this: alias/SPIRE_SERVER/{TRUST_DOMAIN}/{SERVER_ID}/{KEY_ID}. The prefix 'SPIRE_SERVER' was added with the intention of making collisions less likely.

@evan2645
Copy link
Member

Quick clarification:

My comment about alias + key deletion, and threshold, is looking for clarification on exactly what conditions must be met before we decide that an alias + key should be deleted. I see some threshold logic in there. Code comment with this info might be a good idea since the logic is spread across multiple functions

@azdagron
Copy link
Member

I've just merged #2180. Consequently, you'll need to touch up the import for the keymanager definitions.

Old import:
keymanagerv0 "github.com/spiffe/spire/proto/spire/server/keymanager/v0"
New import:
keymanagerv0 "github.com/spiffe/spire/proto/spire/plugin/server/keymanager/v0"

If you'd rather me touch it up and push a commit, I'm happy to oblige. Just let me know.

kunzimariano and others added 4 commits March 31, 2021 21:26
Signed-off-by: Mariano Kunzi <[email protected]>
Signed-off-by: Mariano Kunzi <[email protected]>
- Added documentation about the usage of aliases and the key metadata
file
- Changed the alias threshold from 48 hours to two weeks
- Improved some logging of errors

Signed-off-by: Agustín Martínez Fayó <[email protected]>
@amartinezfayo
Copy link
Member

@kunzimariano During the last maintainers call we discussed this PR and realized that the alias threshold of 48 hours would be too aggressive and a time around two weeks would be more appropriate. I pushed a commit to address that and a few other things:

  • Changed the server_id_file_path setting to be key_metadata_file
  • Added documentation about the usage of aliases and the key metadata file
  • Changed the alias threshold from 48 hours to two weeks
  • Improved some logging of errors

@kunzimariano
Copy link
Member Author

Thanks @amartinezfayo! All those changes seem reasonable.

@azdagron
Copy link
Member

azdagron commented Apr 2, 2021

I just pushed another commit this morning that:

  • does a little wordsmithing on the docs
  • removes some stuff from the docs that were out of scope for the plugin (server key type selection and configuration)
  • reverted testkey changes and instead embedded the testkey.Keys in the fake so keys can be reused across test cases
  • merged in master
  • cleaned up imports

Signed-off-by: Andrew Harding <[email protected]>
evan2645
evan2645 previously approved these changes Apr 2, 2021
Copy link
Member

@evan2645 evan2645 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big thanks @kunzimariano and everyone else for this contribution!

@rturner3 rturner3 dismissed their stale review April 5, 2021 21:27

Changes were addressed in previous commits

@azdagron azdagron merged commit f6c36f9 into spiffe:master Apr 5, 2021
azdagron pushed a commit to azdagron/spire that referenced this pull request Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: AWS KMS management of signing keys
8 participants