-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(eks): version update completes prematurely #7526
Conversation
The `UpdateClusterVersion` operation takes a while to begin and until then, the cluster's status is still `ACTIVE` instead `UPDATING` as expected. This causes the `isComplete` handler, which is called immediately, to think that the operation is complete, when it hasn't even began. Add logic to the cluster version update `onEvent` method to wait up to 5 minutes until the cluster status is no longer `ACTIVE`, so that the subsequent `isComplete` query will be based on the version update operation itself. Extended the timeout of `onEvent` to 15m to ensure it does not interrupt the operation. TESTING: Updated unit tests to verify this retry behavior and performed a manual upgrade tests while examining the logs. Fixes #7457
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sleep(5)
does not seem like the best way to fix a race condition.
According to the API documentation, you are supposed to use a token from the return value of the UpdateClusterVersion
function and poll DescribeUpdate
with that.
If that solution doesn't work for some reason, your rationale for this change should describe why that is.
The sleep is not the fix for the race condition. It's basically a short backoff before querying the cluster's status again. |
The reason I am looking at the cluster's status, which, according to the documentation is expected to be in |
That is the comment I'm looking for (in the codebase): why are we deviating from the expected/recommended pattern to start and wait for a version update to complete? The reason the sleep-based pattern concerns me is because we could be missing a version update that starts and completes between two calls of Now, granted... this may be unlikely because everything is probably slow as molasses. But what about a cluster without nodes in it? Won't that complete in a jiffy? I'll ship it if you add the rationale to a comment in the codebase. |
… isComplete This was already supported, just add some docs and tests to make sure this continues to be supported.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for humoring me <3
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
…ersion Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes #7794
…ersion (#7830) Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since #7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes #7794
…ersion (aws#7830) Cluster version updates fail with `vendor response doesn't contain <ATTRIBUTE>` errors due to the fact that since aws#7526 the provider does not respond to `isComplete` with the `Data` field with resource attributes. The fix is that once the update is complete, we simply delegate to `isActive` which queries the cluster and returns the attributes. Fixes aws#7794
Commit Message
fix(eks): version update completes prematurely (#7526)
The
UpdateClusterVersion
operation takes a while to begin and until then, the cluster's status is stillACTIVE
insteadUPDATING
as expected. This causes theisComplete
handler, which is called immediately, to think that the operation is complete, when it hasn't even began.Modify how
IsComplete
is implemented for cluster version (and config) updates. Extract the update ID and useDescribeUpdate
to monitor the status of the update. This also allows us to fix a latent bug and fail the update in case the version update failed.The update ID is returned from
OnEvent
via a custom fields calledEksUpdateId
and passed on to the subsequentIsComplete
invocation. This was already supported by the custom resource provider framework but not documented or officially tested, so we've added that here as well (docs + test).TESTING: Added unit tests to verify the new type of update waiter and performed a manual upgrade tests while examining the logs.
Fixes #7457
End Commit Message
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license