-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092
Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092
Conversation
Hi @cnmcavoy. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…2 termination or ec2 creation and wait for a future reconcile if necessary
30babb9
to
b2963ff
Compare
/ok-to-test |
/test ? |
@richardcase: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-cluster-api-provider-aws-e2e-blocking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cnmcavoy AFAIR, the blocking call was used such that instance termination happens successfully. As the operation takes some time to finish, hence is the blocking call. What would happen if instance termination doesn't finish at that point, and happen to be at failed state in the later point of time(due to some xyz reason) but we already set the conditions as deleted, but the instance is left behind and may fail the delete operation?
Thinking out loud here, about the downside of removing blocking calls
Thanks for the additional context. If the intent is to ensure that the ec2 instance associated with the AWSMachine is in a terminated state (or absent entirely) we don't need to block, one option could be to requeue the AWSMachine to be re-reconciled in a later reconciliation and repeat this until the cloud provider state matches. Blocking still doesn't seem to be necessary to satisfy that requirement. I can update this PR to include covering that edge case and requeue the AWSMachine after |
…their termination completes
This sounds good to me, I was trying to implying the same 😄 I will take a look at the PR update. Thanks @cnmcavoy |
/test pull-cluster-api-provider-aws-e2e-blocking |
/test pull-cluster-api-provider-aws-e2e |
@cnmcavoy: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm We can ignore that error, as its failing sometimes. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Ankitasw The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
The existing AWSMachine reconciliation loop blocks on terminating an ec2 instance and waits for it to confirm the termination with AWS, as well as blocking on ec2 instance creation and waits up to one minute, for no discernible reason I could detect. This PR removes the blocking behavior, proceeding with reconciliation immediately after the AWS sdk confirms success on the API call.
As a result, this change greatly speeds up large machine deployment rollouts. Previously, very large Machine Deployments (> 100 machines) would take several hours for the machine sets to provision new nodes and tear down machines. This was mostly a result of the reconciliation blocking for each instance creation and delete, and while blocked, prevented additional concurrent work from proceeding, as there is a fixed number of concurrent AWSMachine reconcile workers allowed (controlled by
--awsmachine-concurrency
).We tested this change in one of our clusters - timing it with 600 nodes, split into 3 node groups (really, 1 node group, duplicated for 3 az's in us-east-2). We were testing with t3.small instances which have just enough CPU to run kubelet and the daemonset pods we deploy.
We captured metrics from both the CAPI and CAPA controllers as well as metrics about the state of the CAPI resources in the kubernetes cluster. The most meaningful thing to note in the metrics is the time scale. The rollout took ~2 hours to complete on the mainline version and < 45 minutes on this branch. The CAPA unfinished workqueue metric is also significantly different before and after this change.
This is metrics without the change (running on capi v1.3.2, capa v2.0.2 official release):
This is the metrics from the same cluster after this change (running on capi v1.3.2, this PR branch):
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Split into two commits for review, first commit is the modified behavior. The second commit is updating the e2e test sites for the new reconciliation behavior.
Checklist:
Release note: