AWS KMS Server Keymanager #2066

kunzimariano · 2021-01-22T14:12:21Z

This PR introduces a new server keymanager plugin that uses AWS KMS. There is a high level explanation of it in the next issue.

Some considerations:

When it comes to retries and backoff, the plugin uses the default configuration offered by the AWS SDK. For the sake of simplicity this is not exposed through the plugin configuration, but it could be if necessary. For more details Error retries and exponential backoff in AWS, and default values.
The key_prefix configuration option is a a simple and quick way of preventing key overlapping when servers share the same region. I'm curious to hear if a more elaborated solution is desired.
At this point if key deletion scheduling fails an error is logged, but a retry won't be attempted. This doesn't affect the plugin functionality, but it leaves unused keys which comes with a (low) cost in money. It's easy to find those keys, and delete them outside the plugin, but I think a more robust approach could be better. I considered pushing the key to delete into a queue/channel, and letting a long running goroutine handle it with retries in case of failure. Since I'm not aware of any mechanism for signaling the plugins, and letting them know that the server is shutting down (with the intention of stopping the before mentioned gorutine), I'm reluctant to add this functionality. Any feedback on this topic is appreciated.
~~Logs might be a little bit verbose at this point. Happy to dial it back.~~

Fixes #1921

Signed-off-by: Mariano Kunzi <[email protected]>

Signed-off-by: Andres Gomez Coronel <[email protected]>

- adds alias usage - adds error handling and logging - major refactoring Signed-off-by: Mariano Kunzi <[email protected]>

Signed-off-by: Andres Gomez Coronel <[email protected]>

- Tests converted into table driven - Changes to use test suite assertions Signed-off-by: Maximiliano Churichi <[email protected]>

- adds pagination to aliases listing - adds optional key prefix to configuration - makes access_key_id and secret_access_key optional - fixes weird case of aliases without a key. - more logging Signed-off-by: Mariano Kunzi <[email protected]>

Signed-off-by: Maximiliano Churichi <[email protected]>

Signed-off-by: Mariano Kunzi <[email protected]>

azdagron · 2021-01-27T17:30:33Z

Hi @kunzimariano ! I apologize for the latency reviewing this PR. We're heads down preparing for the 1.0 release but should be able to get to this in the next week or so.

amartinezfayo

Thank you @kunzimariano! It's great to see this coming!
I have some initial comments.

doc/plugin_server_keymanager_awskms.md

pkg/server/plugin/keymanager/awskms/awskms.go

doc/plugin_server_keymanager_awskms.md

pkg/server/plugin/keymanager/awskms/awskms.go

amartinezfayo · 2021-02-05T19:57:40Z

pkg/server/plugin/keymanager/awskms/awskms.go

+
+	if !hasOldEntry {
+		//create alias
+		_, err = p.kmsClient.CreateAliasWithContext(ctx, &kms.CreateAliasInput{


I'm not sure if it's safe to assume that p.kmsClient will not be nil.

I believe that the call to BuiltIn() should guarantee to not be nil. If that isn't enough, would checking for nil and returning an error be enough?

pkg/server/plugin/keymanager/awskms/awskms.go

Signed-off-by: Mariano Kunzi <[email protected]>

fkautz · 2021-02-18T03:37:09Z

Is it possible to get this in by 1.0? It is critical for getting it into production in infrastructure that require signing certificates to be stored in a KMS.

It is actively blocking my ability to run in production environments. I am concerned that there may be a long delay before a 1.1 release which may throw off some of my timelines. :)

fkautz · 2021-02-18T03:40:24Z

Or if 1.0 is a long way out, I'd be happy with a 0.12 point release. ;)

amartinezfayo · 2021-02-18T21:04:24Z

Or if 1.0 is a long way out, I'd be happy with a 0.12 point release. ;)

Hi @fkautz, after reevaluating the 1.0 release plan we decided to have a 0.12 feature release and this PR is planned to be included in that release.

fkautz · 2021-02-18T23:52:29Z

Fantastic, thank you!

rturner3

Hi @kunzimariano , thanks for sending this! The prospect of KMS support in SPIRE is already getting me excited.

There's quite a few changes here, so I'm still taking it all in. Full disclaimer, I didn't get a chance to review all the test code just yet, but overall I think this is off to a good start.

I left you a few comments about some implementation details. I had a couple primary takeaways from the first review of the PR:

We should see if we can be a little more economical in our call patterns of AWS APIs to avoid adding significant delays to SPIRE Server startup time and to prevent unnecessary API calls.
I really appreciate all the effort you've gone to so far to add robust tests. I noticed the current coverage is about 77%, ideally we should try to get this to be as close to 100% as is feasible.

pkg/server/plugin/keymanager/awskms/awskms.go

pkg/server/plugin/keymanager/awskms/client.go

pkg/server/plugin/keymanager/awskms/awskms.go

pkg/server/plugin/keymanager/awskms/awskms_test.go

Signed-off-by: Mariano Kunzi <[email protected]>

kunzimariano

Thanks for your feedback @rturner3! I think I addressed almost all the comments. I will address the plugin init optimization + test improvements and coverage on the next commit.

pkg/server/plugin/keymanager/awskms/client.go

pkg/server/plugin/keymanager/awskms/awskms.go

pkg/server/plugin/keymanager/awskms/awskms_test.go

go.mod

- Fixed up gRPC service definition import for v0 KeyManager - Removed error prefixing (which is handled universally by the v0 shim) Signed-off-by: Andrew Harding <[email protected]>

Signed-off-by: Mariano Kunzi <[email protected]>

azdagron · 2021-03-29T17:20:03Z

pkg/server/plugin/keymanager/awskms/awskms.go

+			}
+			p.notifyDeleteDone()
+			backoff = min(backoff*2, backoffMax)
+			time.Sleep(backoff)


Using an actual sleep here is going to 1) keep this from responding to context cancellation in a timely manner, 2) make it harder to test...

The backoff as applied also ends up applying the backoff to all deletion operations including those for unrelated keyArn's

evan2645

Thanks for hanging in there @kunzimariano, I know it's been tough getting this change through.

I see it's been reviewed several times already. It is also currently blocking 0.12.2 release which is already overdue. I'd really like to see it merged sooner rather than later so we can get this show on the road. I left many comments, but I think it will be best to address many of them in a follow-on PR. I can personally address the logging-related comments after this is merged. I mostly left these comments for everyone else's benefit, so folks can see the logic behind choosing a "good" log message.

Below are the items that I think are most important to get attention in this PR. If it doesn't appear in the list below, then I don't consider it blocking for this PR or for release.

Agreement on configurable name (see docs comment)
Clarification on support for nested topologies, what are the interactions there (if any)
Clarification on alias + key delete logic
- Threshold
- Alias/description matching and collision resistance

conf/server/server_full.conf

doc/spire_server.md

evan2645 · 2021-03-29T21:27:41Z

doc/plugin_server_keymanager_aws_kms.md

+| access_key_id       | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS           | Value of AWS_ACCESS_KEY_ID environment variable      |
+| secret_access_key   | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS       | Value of AWS_SECRET_ACCESS_KEY environment variable  |
+| region              | string | yes                                   | The region where the keys will be stored                |                                                      |
+| server_id_file_path | string | yes                                   | A file path location where the server id will be persisted |                                                      |   


I think we need a section in the docs describing what this is and what it's used for

evan2645 · 2021-03-29T21:37:32Z

doc/plugin_server_keymanager_aws_kms.md

+| access_key_id       | string | see [AWS KMS Access](#aws-kms-access) | The Access Key Id used to authenticate to KMS           | Value of AWS_ACCESS_KEY_ID environment variable      |
+| secret_access_key   | string | see [AWS KMS Access](#aws-kms-access) | The Secret Access Key used to authenticate to KMS       | Value of AWS_SECRET_ACCESS_KEY environment variable  |
+| region              | string | yes                                   | The region where the keys will be stored                |                                                      |
+| server_id_file_path | string | yes                                   | A file path location where the server id will be persisted |                                                      |   


This suggested change is both technically accurate and less confusing when read in the context of a key manager configuration. It also gives the reader a hint at what it's used for and what might happen if it's lost.

When I read "location where the server id will be persisted", I wondered "what's a server id" , "can I not set it?", "what is it used for?", "why does it have to be a file", etc... I think a change similar to the one below avoids a lot of that. The name of the configurable too could probably benefit from an update like key_metadata_file

Suggested change

| server_id_file_path | string | yes | A file path location where the server id will be persisted | |

| server_id_file_path | string | yes | A file path location where information about generated keys will be persisted | |

key_metadata_file sounds much better to me. I also think we need to describe at a high level what its used for, the implications of what happens when it is lost, and how we recover, clean up our own mess. We should probably include a section in the documentation that elaborates on our purge strategy. These are the kinds of these we end up answering in slack over and over.

evan2645 · 2021-03-29T22:18:34Z

pkg/server/plugin/keymanager/awskms/awskms.go

+			now := p.hooks.clk.Now()
+			diff := now.Sub(*alias.LastUpdatedDate)
+			if diff < aliasThreshold {
+				continue


So if the alias hasn't been updated in the last 48 hours, we delete it and schedule deletion of the referenced key?

💯 but only if the alias belong to the current trust domain with exception of the current server.

evan2645 · 2021-03-30T17:18:24Z

pkg/server/plugin/keymanager/awskms/awskms.go

+			case alias.AliasName == nil || alias.LastUpdatedDate == nil || alias.AliasArn == nil:
+				continue
+				// if alias does not belong to trust domain skip
+			case !strings.HasPrefix(*alias.AliasName, p.aliasPrefixForTrustDomain()):


Does this support the case of a nested deployment (single trust domain) in which multiple independent clusters are using KMS? Servers in "this" cluster will be deleting/managing keys in "that" cluster?

I don't think I understand the "this" and "that" analogy. If all those servers are running under the same trust domain, the same AWS account, and the same AWS region then yes.

evan2645 · 2021-03-30T17:34:21Z

pkg/server/plugin/keymanager/awskms/awskms.go

+			describeResp, err := p.kmsClient.DescribeKey(ctx, &kms.DescribeKeyInput{KeyId: alias.AliasArn})
+			switch {
+			case err != nil:
+				log.Error("Failed to describe key to dispose", reasonTag, err)


I think this error message, and most others in this package, would greatly benefit from more clarity. It is useful to think of what the operator experience would be when encountering such a message, and what information is useful.

In this case, I think the useful information is "we could not remove an old key that we wanted to remove". Along with that, it is also useful to know why, or the exact error encountered. This is the case for most of the errors we log in this function. Compare this to information like "failed to describe", "failed to fetch", "malformed", etc... those are more reasons, and are supplementary.

For example, this log could read "Failed to clean up old KMS keys", reason err.

The log below could similarly be "Failed to clean up old KMS keys", reason "missing data in AWS API describe response"

These messages are straight to the point - if I read "failed to clean up old KMS keys", I know exactly what that means and what impact it might have. I know to go check AWS. Compare to "failed to describe key to dispose" - were we still able to delete it? Why are we disposing of it? Etc.

evan2645 · 2021-03-30T18:06:26Z

pkg/server/plugin/keymanager/awskms/awskms.go

+	for {
+		select {
+		case <-ctx.Done():
+			p.log.Debug("Stopping dispose keys task", reasonTag, ctx.Err())


These kinds of log messages feel a little over the top. I say that because they're useful to developers, but not very useful to an operator. If the context is cancelled, there will be lots of other stuff in the log, and what the dispose keys task is doing doesn't feel very important to know (even if it's debug log)

evan2645 · 2021-03-30T18:24:01Z

pkg/server/plugin/keymanager/awskms/awskms_test.go

+					AliasLastUpdatedDate: &unixEpoch,
+				},
+				{
+					AliasName:            aws.String("alias/another_td/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee/id_04"),


What exactly are we matching on when we dispose of keys? Disposing of a KMS key is of course a very sensitive and error-prone operation, we want to be extra sure that we're only deleting the ones we created and manage. I think we should add some test cases here, and perhaps some more verification in the plugin code.

I may have misread, but it seems like we're only looking for trust domain name in the alias... and maybe a static alias/ prefix. Trust domain name is often something with meaning beyond just SPIFFE/SPIRE ... for example, the trust domain name may be the name of the company e.g. acme.co. In that case, it's easy to imagine that we're not too far off from a name collision and possibly deleting keys that aren't ours

That's a great observation. Aliases now look like this: alias/SPIRE_SERVER/{TRUST_DOMAIN}/{SERVER_ID}/{KEY_ID}. The prefix 'SPIRE_SERVER' was added with the intention of making collisions less likely.

evan2645 · 2021-03-30T21:05:50Z

Quick clarification:

My comment about alias + key deletion, and threshold, is looking for clarification on exactly what conditions must be met before we decide that an alias + key should be deleted. I see some threshold logic in there. Code comment with this info might be a good idea since the logic is spread across multiple functions

azdagron · 2021-03-31T00:10:02Z

I've just merged #2180. Consequently, you'll need to touch up the import for the keymanager definitions.

Old import:
keymanagerv0 "github.com/spiffe/spire/proto/spire/server/keymanager/v0"
New import:
keymanagerv0 "github.com/spiffe/spire/proto/spire/plugin/server/keymanager/v0"

If you'd rather me touch it up and push a commit, I'm happy to oblige. Just let me know.

Signed-off-by: Mariano Kunzi <[email protected]>

- Added documentation about the usage of aliases and the key metadata file - Changed the alias threshold from 48 hours to two weeks - Improved some logging of errors Signed-off-by: Agustín Martínez Fayó <[email protected]>

amartinezfayo · 2021-04-02T12:08:59Z

@kunzimariano During the last maintainers call we discussed this PR and realized that the alias threshold of 48 hours would be too aggressive and a time around two weeks would be more appropriate. I pushed a commit to address that and a few other things:

Changed the server_id_file_path setting to be key_metadata_file
Added documentation about the usage of aliases and the key metadata file
Changed the alias threshold from 48 hours to two weeks
Improved some logging of errors

Signed-off-by: Andrew Harding <[email protected]>

kunzimariano · 2021-04-02T14:10:59Z

Thanks @amartinezfayo! All those changes seem reasonable.

Signed-off-by: Andrew Harding <[email protected]>

azdagron · 2021-04-02T14:20:59Z

I just pushed another commit this morning that:

does a little wordsmithing on the docs
removes some stuff from the docs that were out of scope for the plugin (server key type selection and configuration)
reverted testkey changes and instead embedded the testkey.Keys in the fake so keys can be reused across test cases
merged in master
cleaned up imports

Signed-off-by: Andrew Harding <[email protected]>

evan2645

big thanks @kunzimariano and everyone else for this contribution!

pkg/server/plugin/keymanager/awskms/fetcher.go

Signed-off-by: Andrew Harding <[email protected]>

Changes were addressed in previous commits

Signed-off-by: Mariano Kunzi <[email protected]>

kunzimariano and others added 8 commits January 20, 2021 14:40

adds initial implementation for aws kms keymanager plugin

c14aae6

Signed-off-by: Mariano Kunzi <[email protected]>

Added kmsClientFake and initial test case.

18d1eeb

Signed-off-by: Andres Gomez Coronel <[email protected]>

awskms improvements:

81220b3

- adds alias usage - adds error handling and logging - major refactoring Signed-off-by: Mariano Kunzi <[email protected]>

adds unit tests for awskmw

7c4dd57

Signed-off-by: Andres Gomez Coronel <[email protected]>

- Fixes unit tests for awskms

782093b

- Tests converted into table driven - Changes to use test suite assertions Signed-off-by: Maximiliano Churichi <[email protected]>

- makes awskms plugin builtin

d045ef3

- adds pagination to aliases listing - adds optional key prefix to configuration - makes access_key_id and secret_access_key optional - fixes weird case of aliases without a key. - more logging Signed-off-by: Mariano Kunzi <[email protected]>

Adds AWS KMS KeyManager documentation

2fd9ca2

Signed-off-by: Maximiliano Churichi <[email protected]>

renames and updates plugin_server_keymanager_awskms.md

d938dfe

Signed-off-by: Mariano Kunzi <[email protected]>

kunzimariano requested review from amartinezfayo, APTy, azdagron, evan2645 and rturner3 as code owners January 22, 2021 14:12

kunzimariano changed the title ~~Awskms~~ AWS KMS Server Keymanager Jan 22, 2021

azdagron assigned azdagron and evan2645 Feb 2, 2021

amartinezfayo reviewed Feb 5, 2021

View reviewed changes

address comments

3f63a1f

Signed-off-by: Mariano Kunzi <[email protected]>

rturner3 self-assigned this Feb 18, 2021

evan2645 removed their assignment Feb 18, 2021

rturner3 previously requested changes Feb 20, 2021

View reviewed changes

kunzimariano added 2 commits February 22, 2021 21:42

updates aws-sdk-go to v2

09b28de

Signed-off-by: Mariano Kunzi <[email protected]>

address comments

e319826

Signed-off-by: Mariano Kunzi <[email protected]>

kunzimariano commented Feb 23, 2021

View reviewed changes

MarcosDY reviewed Feb 24, 2021

View reviewed changes

go.mod Show resolved Hide resolved

Apply KeyManager Facade related changes to AWS KMS KeyManager

adc723c

- Fixed up gRPC service definition import for v0 KeyManager - Removed error prefixing (which is handled universally by the v0 shim) Signed-off-by: Andrew Harding <[email protected]>

azdagron added this to the 0.12.2 milestone Mar 25, 2021

address comments

c586d94

Signed-off-by: Mariano Kunzi <[email protected]>

azdagron reviewed Mar 29, 2021

View reviewed changes

evan2645 reviewed Mar 30, 2021

View reviewed changes

kunzimariano and others added 4 commits March 31, 2021 21:26

tests improvements for tasks

089a6f6

Signed-off-by: Mariano Kunzi <[email protected]>

adds SPIRE_SERVER/ to alias prefix

c54858e

Signed-off-by: Mariano Kunzi <[email protected]>

adds description to tasks

8cea751

Signed-off-by: Mariano Kunzi <[email protected]>

Wordsmithing and testkey usage improvement

de0d98e

Signed-off-by: Andrew Harding <[email protected]>

azdagron added 2 commits April 2, 2021 08:11

Merge remote-tracking branch 'origin/master' into kunzimariano-awskms

49f5a97

Signed-off-by: Andrew Harding <[email protected]>

fix imports and tidy

aef3294

Signed-off-by: Andrew Harding <[email protected]>

fix linting

0af8f7e

Signed-off-by: Andrew Harding <[email protected]>

evan2645 previously approved these changes Apr 2, 2021

View reviewed changes

rturner3 reviewed Apr 2, 2021

View reviewed changes

pkg/server/plugin/keymanager/awskms/fetcher.go Outdated Show resolved Hide resolved

pkg/server/plugin/keymanager/awskms/fetcher.go Outdated Show resolved Hide resolved

pkg/server/plugin/keymanager/awskms/fetcher.go Outdated Show resolved Hide resolved

rturner3 reviewed Apr 3, 2021

View reviewed changes

pkg/server/plugin/keymanager/awskms/fetcher.go Show resolved Hide resolved

Only do deep validation on aliases we manage

d04eaec

Signed-off-by: Andrew Harding <[email protected]>

azdagron dismissed evan2645’s stale review via d04eaec April 5, 2021 21:08

evan2645 approved these changes Apr 5, 2021

View reviewed changes

Merge branch 'master' into awskms

b422c13

azdagron merged commit f6c36f9 into spiffe:master Apr 5, 2021

azdagron mentioned this pull request Apr 5, 2021

Update v0.12.2 release branch #2194

Merged

azdagron pushed a commit to azdagron/spire that referenced this pull request Apr 6, 2021

AWS KMS Server Keymanager (spiffe#2066)

0deb0f5

Signed-off-by: Mariano Kunzi <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS KMS Server Keymanager #2066

AWS KMS Server Keymanager #2066

kunzimariano commented Jan 22, 2021 •

edited

Loading

azdagron commented Jan 27, 2021

amartinezfayo left a comment

amartinezfayo Feb 5, 2021

kunzimariano Feb 10, 2021

fkautz commented Feb 18, 2021

fkautz commented Feb 18, 2021

amartinezfayo commented Feb 18, 2021

fkautz commented Feb 18, 2021

rturner3 left a comment

kunzimariano left a comment

azdagron Mar 29, 2021

azdagron Mar 29, 2021

evan2645 left a comment

evan2645 Mar 29, 2021

evan2645 Mar 29, 2021

azdagron Apr 1, 2021

evan2645 Mar 29, 2021

kunzimariano Apr 1, 2021

evan2645 Mar 30, 2021

kunzimariano Apr 1, 2021

evan2645 Mar 30, 2021

evan2645 Mar 30, 2021

evan2645 Mar 30, 2021

kunzimariano Apr 1, 2021

evan2645 commented Mar 30, 2021

azdagron commented Mar 31, 2021

amartinezfayo commented Apr 2, 2021

kunzimariano commented Apr 2, 2021

azdagron commented Apr 2, 2021

evan2645 left a comment

	\| server_id_file_path \| string \| yes \| A file path location where the server id will be persisted \| \|
	\| server_id_file_path \| string \| yes \| A file path location where information about generated keys will be persisted \| \|

AWS KMS Server Keymanager #2066

AWS KMS Server Keymanager #2066

Conversation

kunzimariano commented Jan 22, 2021 • edited Loading

azdagron commented Jan 27, 2021

amartinezfayo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fkautz commented Feb 18, 2021

fkautz commented Feb 18, 2021

amartinezfayo commented Feb 18, 2021

fkautz commented Feb 18, 2021

rturner3 left a comment

Choose a reason for hiding this comment

kunzimariano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan2645 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

evan2645 commented Mar 30, 2021

azdagron commented Mar 31, 2021

amartinezfayo commented Apr 2, 2021

kunzimariano commented Apr 2, 2021

azdagron commented Apr 2, 2021

evan2645 left a comment

Choose a reason for hiding this comment

kunzimariano commented Jan 22, 2021 •

edited

Loading