Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.6.0 to 1.6.1 upgrade breaking changes: AWS::KMS::Key used for secret encryption can not be replaced #644

Closed
bnaydenov opened this issue Apr 6, 2023 · 17 comments
Labels
bug Something isn't working

Comments

@bnaydenov
Copy link
Contributor

bnaydenov commented Apr 6, 2023

Describe the bug

Having exciting eks cluster provisioned with cdk blueprint 1.6.0 & aws cdk 2.66.1 and using .useDefaultSecretEncryption(true) which is set by default if not present, fails during cdk deploy when cdk blueprint is upgraded to 1.6.1 & aws cdk 2.72.0 with following error:

 UPDATE_FAILED        | Custom::AWSCDK-EKS-Cluster            |  awskeyresourceID
Received response status [FAILED] from custom resource. Message returned: Cannot update cluster encryption configuration

Logs: /aws/lambda/eks-cluster-name-awscdk-OnEventHandler42BEBAE0-tg5FN0dg7Szg

at ClusterResourceHandler.onUpdate (/var/task/cluster.js:1:2271)
at ClusterResourceHandler.onEvent (/var/task/common.js:1:680)
at Runtime.onEvent [as handler] (/var/task/index.js:1:1434)
at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25) (RequestId: 1fbd2f3d-2d02-405b-89f5-2eacac7bfffb)

CloudWatch log: Logs: /aws/lambda/eks-cluster-name-awscdk-OnEventHandler42BEBAE0-tg5FN0dg7Szg is having this error:

{
    "errorType": "Error",
    "errorMessage": "Cannot update cluster encryption configuration",
    "stack": [
        "Error: Cannot update cluster encryption configuration",
        "    at ClusterResourceHandler.onUpdate (/var/task/cluster.js:1:2271)",
        "    at ClusterResourceHandler.onEvent (/var/task/common.js:1:680)",
        "    at Runtime.onEvent [as handler] (/var/task/index.js:1:1434)",
        "    at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25)"
    ]
}

This is output of cdk diff which show default AWS KMS key need to be destroyed and recreated.

cdk-diff

Look this issue which describes cannot change EKS cluster encryption config
aws-cloudformation/cloudformation-coverage-roadmap#1234

Expected Behavior

Upgrading cdk blueprint from 1.6.0 & aws cdk 2.66.1 to 1.6.1 & aws cdk 2.72.0 should upgrade without problem exciting eks cluster created with default setting for .useDefaultSecretEncryption(true)

Current Behavior

Running cdk deploy fails:

 UPDATE_FAILED        | Custom::AWSCDK-EKS-Cluster            |  awskeyresourceID
Received response status [FAILED] from custom resource. Message returned: Cannot update cluster encryption configuration

Logs: /aws/lambda/eks-cluster-name-awscdk-OnEventHandler42BEBAE0-tg5FN0dg7Szg

at ClusterResourceHandler.onUpdate (/var/task/cluster.js:1:2271)
at ClusterResourceHandler.onEvent (/var/task/common.js:1:680)
at Runtime.onEvent [as handler] (/var/task/index.js:1:1434)
at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25) (RequestId: 1fbd2f3d-2d02-405b-89f5-2eacac7bfffb)

CloudWatch log: Logs: /aws/lambda/eks-cluster-name-awscdk-OnEventHandler42BEBAE0-tg5FN0dg7Szg is having this error:

{
    "errorType": "Error",
    "errorMessage": "Cannot update cluster encryption configuration",
    "stack": [
        "Error: Cannot update cluster encryption configuration",
        "    at ClusterResourceHandler.onUpdate (/var/task/cluster.js:1:2271)",
        "    at ClusterResourceHandler.onEvent (/var/task/common.js:1:680)",
        "    at Runtime.onEvent [as handler] (/var/task/index.js:1:1434)",
        "    at Runtime.handleOnceNonStreaming (/var/runtime/Runtime.js:74:25)"
    ]
}

Reproduction Steps

Install eks cluster using cdk blueprint 1.6.0 and aws cdk 2.66.1 and then try to upgrade cdk blueprint to 1.6.1 and aws cdk to 2.72.0 and try to run cdk deploy will reproduce this

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.72.0 (build 397e46c)

EKS Blueprints Version

1.6.1

Node.js Version

v16.17.0

Environment details (OS name and version, etc.)

Mac OS 12.6 Monterey - M1 Max

Other information

No response

@bnaydenov bnaydenov added the bug Something isn't working label Apr 6, 2023
@bnaydenov
Copy link
Contributor Author

@shapirov103 ping

@shapirov103
Copy link
Collaborator

@shapirov103 ping

@bnaydenov thanks for bringing it up to our attention, I unfortunately was out for some time, will look into and address shortly.

@shapirov103
Copy link
Collaborator

@aliaksei-ivanou - looks like the issue is caused by the refactoring of the kms key provider. Can you look into it for a workaround that enables continuity for clusters provisioned before 1.6.1?

@aliaksei-ivanou
Copy link
Contributor

Sorry for the issue. The associated code changes were meant to address the KMS key id confusion issue - you couldn't create more than one key in the same pattern because the key id was hardcoded. We also separated the create and lookup KMS key capabilities into two separate providers.

To solve the issue in this PR, I added an optional KMS keyId parameter that allows to use an existing key.

@Feder1co5oave
Copy link
Contributor

Feder1co5oave commented Apr 11, 2023

I've had this same issue after upgrading. This is part of a broader issue with changing how logical ids are generated.

Any change to a construct's id across (blueprint) versions will have very serious consequences on existing deployments.

In this case, changing the key's id will render the existing (physical) KMS key orphaned (it will not be physically deleted, because the retention policy is RETAIN by default, but the CF stack won't have any knowledge of it anymore), a new key will be created, and associated with the cluster's secrets encryption. That last step will fail, because the encryption key of a cluster cannot be changed after creation.
You can get out of this situation with a workaround: you can provide the existing KMS key to the blueprint by passing a ResourceProvider that will look it up by ARN. For example:

EksBlueprint.builder()
    .account(process.env.CDK_DEFAULT_ACCOUNT)
    .region(process.env.CDK_DEFAULT_REGION)
    .resourceProvider(GlobalResources.KmsKey, new class implements ResourceProvider<kms.IKey> {
            provide(context: ResourceContext): kms.IKey {
                    return kms.Key.fromKeyArn(context.scope, 'eks-secrets-existing-encryption-key', EXISTING_KEY_ARN)
            }
        })

You can find the existing key in the KMS console, with the description
Secrets Encryption Key for EKS Cluster '${CLUSTER_NAME}'.
You could also associate an alias to the key, and use a new LookupKmsKeyProvider(EXISTING_KEY_ALIAS) instead. The code above works with just the ARN.

However, once you cdk deploy after upgrading to 1.6.1, the existing key will be orphaned, and if you delete it by accident you can lose access to all your cluster's secrets.
Also, since it is not part of any CF stack, it will not be deleted after you delete that stack.
To my knowledge, once a resource has been orphaned, it is either impossible, or very difficult to reimport into the stack.

My suggestion would be that future changes that will involve changing how ids are generated, are either rejected, or considered breaking changes and an upgrade guide be provided for people wanting to upgrade to the new version.
Otherwise, users will either be:

  1. stuck at the current version, making it very difficult to maintain the cluster updated
  2. bring their existing stacks into a FAILED state, possible irreversibly so.

Is there any way to test for this kind of breaking changes in CI? For example, deploy a blueprint with an existing version, upgrade to the commit to be tested, and run cdk deploy to check for errors? /cc @shapirov103

TBH, it is not clear to me how the workaround suggested by @aliaksei-ivanou should be used.

@bnaydenov
Copy link
Contributor Author

@Feder1co5oave i am not suggesting workaround, you probably mean workaround and PR from @aliaksei-ivanou

@aliaksei-ivanou
Copy link
Contributor

I've had this same issue after upgrading. This is part of a broader issue with changing how logical ids are generated.

Any change to a construct's id across (blueprint) versions will have very serious consequences on existing deployments.

In this case, changing the key's id will render the existing (physical) KMS key orphaned (it will not be physically deleted, because the retention policy is RETAIN by default, but the CF stack won't have any knowledge of it anymore), a new key will be created, and associated with the cluster's secrets encryption. That last step will fail, because the encryption key of a cluster cannot be changed after creation. You can get out of this situation with a workaround: you can provide the existing KMS key to the blueprint by passing a ResourceProvider that will look it up by ARN. For example:

EksBlueprint.builder()
    .account(process.env.CDK_DEFAULT_ACCOUNT)
    .region(process.env.CDK_DEFAULT_REGION)
    .resourceProvider(GlobalResources.KmsKey, new class implements ResourceProvider<kms.IKey> {
            provide(context: ResourceContext): kms.IKey {
                    return kms.Key.fromKeyArn(context.scope, 'eks-secrets-existing-encryption-key', EXISTING_KEY_ARN)
            }
        })

You can find the existing key in the KMS console, with the description Secrets Encryption Key for EKS Cluster '${CLUSTER_NAME}'. You could also associate an alias to the key, and use a new LookupKmsKeyProvider(EXISTING_KEY_ALIAS) instead. The code above works with just the ARN.

However, once you cdk deploy after upgrading to 1.6.1, the existing key will be orphaned, and if you delete it by accident you can lose access to all your cluster's secrets. Also, since it is not part of any CF stack, it will not be deleted after you delete that stack. To my knowledge, once a resource has been orphaned, it is either impossible, or very difficult to reimport into the stack.

My suggestion would be that future changes that will involve changing how ids are generated, are either rejected, or considered breaking changes and an upgrade guide be provided for people wanting to upgrade to the new version. Otherwise, users will either be:

  1. stuck at the current version, making it very difficult to maintain the cluster updated
  2. bring their existing stacks into a FAILED state, possible irreversibly so.

Is there any way to test for this kind of breaking changes in CI? For example, deploy a blueprint with an existing version, upgrade to the commit to be tested, and run cdk deploy to check for errors? /cc @shapirov103

TBH, it is not clear to me how the workaround suggested by @aliaksei-ivanou should be used.

That's right, you can use the LookupKmsKeyProvider(EXISTING_KEY_ALIAS) for using the existing KMS key, or you could keep using the CreateKmsKeyProvider as discussed here.

@shapirov103
Copy link
Collaborator

@Feder1co5oave I will add an item for backwards compatibility tests to be included on the CI at least as part of the release. It most likely won't happen until July. Such tests are a bit more expensive to run.

@aliaksei-ivanou
Copy link
Contributor

I believe that should fix it 9fae44b

@Feder1co5oave
Copy link
Contributor

FYI, I had upgraded to 1.6.1 and deployed the stack on a non-production environment, making my cluster's KMS key an orphan.

After great effort, I was able to re-import my KMS key into the stack with the previous logical id, by using the experimental cdk import, and then rollback to 1.6.0.

My suggestion to all: don't do this. I had to jerry-rig a couple of code fixes in both eks-blueprints and the CDK CLI itself. Either rollback your stack right after the upgrade to 1.6.1 and recover your key (I'm not even sure this works?) or wait until the workaround is released.

@praveenperera
Copy link
Contributor

praveenperera commented Apr 25, 2023

This lead me to a rollback failed, how did you proceed from that?

edit: was able to continue rollback while ignoring the cluster, ran deploy again on 1.7.0, looks like its working

@shapirov103
Copy link
Collaborator

Thanks, @praveenperera - I will keep the issue for a bit to collect more feedback. Happy to apply additional changes if needed.

@yoshi23
Copy link

yoshi23 commented Apr 27, 2023

I tried using both the LookupKmsKeyProvider and new ResourceProvider but both will insert the existing key's ARN instead of a Fn::GetAtt function with the alias of the key. This is considered a change to the cluster encryption configuration and thus fail for clusters deployed before 1.6.1. I was trying this with [email protected] and [email protected]. Providing lines for the key lookup and a screenshot below of a local cdk diff result.

Am I using them wrong? Thanks for the help.

Code lines:
const stack = blueprints.EksBlueprint.builder()
then
stack.resourceProvider(GlobalResources.KmsKey, new LookupKmsKeyProvider("alias/eks-kms-key-manual-alias"))
Or:
stack.resourceProvider(GlobalResources.KmsKey, new class implements ResourceProvider<kms.IKey> { provide(context: blueprints.ResourceContext): kms.IKey { return kms.Key.fromKeyArn(context.scope, 'eks-secrets-existing-encryption-key', eksKmsKeyArn) } })

Screenshot:
Screenshot 2023-04-27 at 7 34 03 PM

@aliaksei-ivanou
Copy link
Contributor

Hi @yoshi23!

Please try upgrading to 1.7.0. The backward compatibility for the KMS resource provider was added 9fae44b.

@shapirov103
Copy link
Collaborator

@yoshi23 the approach that we used was to make the identifier of the created KMS key identical between the versions. So if migrating from 1.6.0 to 1.7.0 you don't need anything explicit to register as a resource provider, it should just work.

If you want to be explicit about it, then it will be equivalent of using new CreateKmsKeyProvider() with no args passed to the constructor, which is identical to what we did in 1.6.0.

Changing to lookup providers will result in failure, since CFN stack expects the key to be in the current state management (i.e. created).

@yoshi23
Copy link

yoshi23 commented Apr 28, 2023

Thank you for the quick reply @aliaksei-ivanou and @shapirov103! I confirm that I'm able to deploy clusters without specifying the key with [email protected] and [email protected].

The issue for me was that I had multiple clusters, some deployed before and some after 1.6.1. The error came up with one of them as I tried to run a CDK deploy from CI/CD and it failed with the encryption change error.

Finding this issue and fix here, I assumed that the solution would be to force provide the existing KMS key everywhere but that lead to a lot of inconsistencies between the clusters, which I couldn't really entangle anymore. (I ended up deleting the clusters and re-deploying with 1.7.0 - they were pre-production, so I found that easier)

Thank you for your work on the Blueprint framework, it is a great tool!

@shapirov103
Copy link
Collaborator

Assuming resolved, we can re-open if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants