fix(eks): version update completes prematurely #7526

eladb · 2020-04-22T20:58:24Z

Commit Message

fix(eks): version update completes prematurely (#7526)

The UpdateClusterVersion operation takes a while to begin and until then, the cluster's status is still ACTIVE instead UPDATING as expected. This causes the isComplete handler, which is called immediately, to think that the operation is complete, when it hasn't even began.

Modify how IsComplete is implemented for cluster version (and config) updates. Extract the update ID and use DescribeUpdate to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed.

The update ID is returned from OnEvent via a custom fields called EksUpdateId and passed on to the subsequent IsComplete invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test).

TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs.

Fixes #7457

End Commit Message

Manual test

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself. Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation. TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs. Fixes #7457

aws-cdk-automation · 2020-04-22T21:10:56Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 97fe9d1
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws-cdk-automation · 2020-04-23T06:42:28Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 258fbe7
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws-cdk-automation · 2020-04-23T06:47:35Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 1ddfb78
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

rix0rrr

sleep(5) does not seem like the best way to fix a race condition.

According to the API documentation, you are supposed to use a token from the return value of the UpdateClusterVersion function and poll DescribeUpdate with that.

If that solution doesn't work for some reason, your rationale for this change should describe why that is.

eladb · 2020-04-23T08:53:50Z

sleep(5) does not seem like the best way to fix a race condition.

The sleep is not the fix for the race condition. It's basically a short backoff before querying the cluster's status again.

eladb · 2020-04-23T08:55:53Z

According to the API documentation, you are supposed to use a token from the return value of the UpdateClusterVersion function and poll DescribeUpdate with that.

The reason I am looking at the cluster's status, which, according to the documentation is expected to be in UPDATING during the version update (and it is) is to simplify the isComplete handler. It always waits for the cluster to become ACTIVE.

rix0rrr · 2020-04-23T09:01:05Z

is to simplify the isComplete handler. It always waits for the cluster to become ACTIVE.

That is the comment I'm looking for (in the codebase): why are we deviating from the expected/recommended pattern to start and wait for a version update to complete?

The reason the sleep-based pattern concerns me is because we could be missing a version update that starts and completes between two calls of DescribeCluster, while you're waiting to transition from ACTIVE -> UPDATING. Now you never see the cluster updating and it will be broken as well. I'd hate to trade one race condition for another, especially if there's a deterministic API available.

Now, granted... this may be unlikely because everything is probably slow as molasses. But what about a cluster without nodes in it? Won't that complete in a jiffy?

I'll ship it if you add the rationale to a comment in the codebase.

… isComplete This was already supported, just add some docs and tests to make sure this continues to be supported.

aws-cdk-automation · 2020-04-23T12:46:00Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 4f18dc3
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws-cdk-automation · 2020-04-23T12:56:18Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 074a7d2
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws-cdk-automation · 2020-04-23T14:27:45Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 6463344
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

rix0rrr

Thanks for humoring me <3

aws-cdk-automation · 2020-04-23T15:47:50Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 2de3eff
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

aws-cdk-automation · 2020-04-23T17:39:46Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: 1be46b6
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mergify · 2020-04-23T17:40:42Z

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

aws-cdk-automation · 2020-04-23T18:23:21Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
Commit ID: d6a0d69
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mergify · 2020-04-23T18:24:08Z

Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

…ersion Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes #7794

…ersion (#7830) Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes #7794

…ersion (aws#7830) Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since aws#7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes aws#7794

eladb requested a review from a team April 22, 2020 20:58

eladb self-assigned this Apr 22, 2020

eladb added the pr/do-not-merge This PR should not be merged at this time. label Apr 22, 2020

mergify bot added the contribution/core This is a PR that came from AWS. label Apr 22, 2020

update expectation

258fbe7

eladb removed the pr/do-not-merge This PR should not be merged at this time. label Apr 23, 2020

Merge branch 'master' into benisrae/eks-fix-version-update

1ddfb78

rix0rrr requested changes Apr 23, 2020

View reviewed changes

Elad Ben-Israel added 2 commits April 23, 2020 15:10

custom-resources: formalize the notion of passing arbitrary fields to…

aeeb384

… isComplete This was already supported, just add some docs and tests to make sure this continues to be supported.

use describeUpdate instead of cluster status

4f18dc3

eladb requested a review from rix0rrr April 23, 2020 12:11

eladb mentioned this pull request Apr 23, 2020

[aws-eks] AWSCDK-EKS-KubernetesResource started to early after AWSCDK-EKS-Cluster upgrade #7457

Closed

update timeout back to 1m

074a7d2

add DescribeUpdate permission

6463344

Merge branch 'master' into benisrae/eks-fix-version-update

2de3eff

rix0rrr approved these changes Apr 23, 2020

View reviewed changes

Merge branch 'master' into benisrae/eks-fix-version-update

1be46b6

Merge branch 'master' into benisrae/eks-fix-version-update

d6a0d69

mergify bot merged commit 307c8b0 into master Apr 23, 2020

mergify bot deleted the benisrae/eks-fix-version-update branch April 23, 2020 18:24

moatazelmasry2 mentioned this pull request May 5, 2020

[aws-eks] Stack breaks when upgrading an EKS Cluster #7794

Closed

eladb mentioned this pull request May 6, 2020

fix(eks): "vendor response doesn't contain attribute" when updating version #7830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eks): version update completes prematurely #7526

fix(eks): version update completes prematurely #7526

eladb commented Apr 22, 2020 •

edited

Loading

aws-cdk-automation commented Apr 22, 2020

aws-cdk-automation commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

rix0rrr left a comment

eladb commented Apr 23, 2020

eladb commented Apr 23, 2020

rix0rrr commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

rix0rrr left a comment

aws-cdk-automation commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

mergify bot commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

mergify bot commented Apr 23, 2020

fix(eks): version update completes prematurely #7526

fix(eks): version update completes prematurely #7526

Conversation

eladb commented Apr 22, 2020 • edited Loading

Commit Message

End Commit Message

aws-cdk-automation commented Apr 22, 2020

AWS CodeBuild CI Report

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

rix0rrr left a comment

Choose a reason for hiding this comment

eladb commented Apr 23, 2020

eladb commented Apr 23, 2020

rix0rrr commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

rix0rrr left a comment

Choose a reason for hiding this comment

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

mergify bot commented Apr 23, 2020

aws-cdk-automation commented Apr 23, 2020

AWS CodeBuild CI Report

mergify bot commented Apr 23, 2020

eladb commented Apr 22, 2020 •

edited

Loading