Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

cnmcavoy · 2023-02-21T22:37:19Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
The existing AWSMachine reconciliation loop blocks on terminating an ec2 instance and waits for it to confirm the termination with AWS, as well as blocking on ec2 instance creation and waits up to one minute, for no discernible reason I could detect. This PR removes the blocking behavior, proceeding with reconciliation immediately after the AWS sdk confirms success on the API call.

As a result, this change greatly speeds up large machine deployment rollouts. Previously, very large Machine Deployments (> 100 machines) would take several hours for the machine sets to provision new nodes and tear down machines. This was mostly a result of the reconciliation blocking for each instance creation and delete, and while blocked, prevented additional concurrent work from proceeding, as there is a fixed number of concurrent AWSMachine reconcile workers allowed (controlled by --awsmachine-concurrency).

We tested this change in one of our clusters - timing it with 600 nodes, split into 3 node groups (really, 1 node group, duplicated for 3 az's in us-east-2). We were testing with t3.small instances which have just enough CPU to run kubelet and the daemonset pods we deploy.

We captured metrics from both the CAPI and CAPA controllers as well as metrics about the state of the CAPI resources in the kubernetes cluster. The most meaningful thing to note in the metrics is the time scale. The rollout took ~2 hours to complete on the mainline version and < 45 minutes on this branch. The CAPA unfinished workqueue metric is also significantly different before and after this change.

This is metrics without the change (running on capi v1.3.2, capa v2.0.2 official release):

This is the metrics from the same cluster after this change (running on capi v1.3.2, this PR branch):

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
Split into two commits for review, first commit is the modified behavior. The second commit is updating the e2e test sites for the new reconciliation behavior.

Checklist:

squashed commits
includes documentation
adds unit tests
adds or updates e2e tests

Release note:

Modifies the AWSMachine reconciliation to no longer block on ec2 termination or creation, resulting in faster rollouts for large Machine Deployments.

k8s-ci-robot · 2023-02-21T22:37:28Z

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

controllers/awsmachine_controller.go

…2 termination or ec2 creation and wait for a future reconcile if necessary

richardcase · 2023-02-23T20:12:23Z

/ok-to-test

richardcase · 2023-02-23T20:12:32Z

/test ?

k8s-ci-robot · 2023-02-23T20:12:34Z

@richardcase: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardcase · 2023-02-23T20:12:47Z

/test pull-cluster-api-provider-aws-e2e-blocking

Ankitasw

@cnmcavoy AFAIR, the blocking call was used such that instance termination happens successfully. As the operation takes some time to finish, hence is the blocking call. What would happen if instance termination doesn't finish at that point, and happen to be at failed state in the later point of time(due to some xyz reason) but we already set the conditions as deleted, but the instance is left behind and may fail the delete operation?
Thinking out loud here, about the downside of removing blocking calls

cnmcavoy · 2023-02-27T21:52:01Z

@Ankitasw

Thanks for the additional context. If the intent is to ensure that the ec2 instance associated with the AWSMachine is in a terminated state (or absent entirely) we don't need to block, one option could be to requeue the AWSMachine to be re-reconciled in a later reconciliation and repeat this until the cloud provider state matches. Blocking still doesn't seem to be necessary to satisfy that requirement.

I can update this PR to include covering that edge case and requeue the AWSMachine after TerminateInstance until the lookup for the instance no longer succeeds.

…their termination completes

Ankitasw · 2023-02-28T05:27:19Z

I can update this PR to include covering that edge case and requeue the AWSMachine after TerminateInstance until the lookup for the instance no longer succeeds.

This sounds good to me, I was trying to implying the same 😄 I will take a look at the PR update. Thanks @cnmcavoy

Ankitasw · 2023-02-28T06:12:28Z

/test pull-cluster-api-provider-aws-e2e-blocking

Ankitasw · 2023-02-28T06:21:53Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

k8s-ci-robot · 2023-02-28T11:46:38Z

@cnmcavoy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e-eks	`f5d1d45`	link	false	`/test pull-cluster-api-provider-aws-e2e-eks`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Ankitasw · 2023-02-28T11:48:00Z

/lgtm
/approve

We can ignore that error, as its failing sometimes.

k8s-ci-robot · 2023-02-28T11:48:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ankitasw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Ankitasw]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from pydctw and Skarlso February 21, 2023 22:37

k8s-ci-robot added needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 21, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 21, 2023

javanthropus reviewed Feb 21, 2023

View reviewed changes

controllers/awsmachine_controller.go Outdated Show resolved Hide resolved

cnmcavoy added 2 commits February 22, 2023 15:21

Change the AWSMachine controller behavior to not block and confirm ec…

8d4f47f

…2 termination or ec2 creation and wait for a future reconcile if necessary

Update tests to account for updated awsmachine reconciliation behavior

b2963ff

cnmcavoy force-pushed the cnmcavoy/speedy-provision-and-termination branch from 30babb9 to b2963ff Compare February 22, 2023 21:32

cnmcavoy changed the title ~~WIP: Modify AWSMachine reconciliation behavior to terminate and create instances without blocking~~ Modify AWSMachine reconciliation behavior to terminate and create instances without blocking Feb 22, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 23, 2023

Ankitasw reviewed Feb 27, 2023

View reviewed changes

Requeue terminating AWSMachines and don't remove the finalizer until …

f5d1d45

…their termination completes

Ankitasw added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Feb 28, 2023

k8s-ci-robot assigned Ankitasw Feb 28, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2023

k8s-ci-robot merged commit 8d64bc6 into kubernetes-sigs:main Feb 28, 2023

This was referenced May 25, 2023

Self-hosted management cluster control plane upgrade takes more than 15min to finish one control plane #4293

Closed

Revert creating instance with wait blocking #4296

Closed

cnmcavoy mentioned this pull request Sep 21, 2023

REQUEST: New membership for @cnmcavoy kubernetes/org#4474

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

cnmcavoy commented Feb 21, 2023 •

edited

Loading

k8s-ci-robot commented Feb 21, 2023

richardcase commented Feb 23, 2023

richardcase commented Feb 23, 2023

k8s-ci-robot commented Feb 23, 2023

richardcase commented Feb 23, 2023

Ankitasw left a comment

cnmcavoy commented Feb 27, 2023

Ankitasw commented Feb 28, 2023 •

edited

Loading

Ankitasw commented Feb 28, 2023

Ankitasw commented Feb 28, 2023

k8s-ci-robot commented Feb 28, 2023

Ankitasw commented Feb 28, 2023

k8s-ci-robot commented Feb 28, 2023

Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

Conversation

cnmcavoy commented Feb 21, 2023 • edited Loading

k8s-ci-robot commented Feb 21, 2023

richardcase commented Feb 23, 2023

richardcase commented Feb 23, 2023

k8s-ci-robot commented Feb 23, 2023

richardcase commented Feb 23, 2023

Ankitasw left a comment

Choose a reason for hiding this comment

cnmcavoy commented Feb 27, 2023

Ankitasw commented Feb 28, 2023 • edited Loading

Ankitasw commented Feb 28, 2023

Ankitasw commented Feb 28, 2023

k8s-ci-robot commented Feb 28, 2023

Ankitasw commented Feb 28, 2023

k8s-ci-robot commented Feb 28, 2023

cnmcavoy commented Feb 21, 2023 •

edited

Loading

Ankitasw commented Feb 28, 2023 •

edited

Loading