Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify AWSMachine reconciliation behavior to terminate and create instances without blocking #4092

Conversation

cnmcavoy
Copy link
Contributor

@cnmcavoy cnmcavoy commented Feb 21, 2023

What type of PR is this?
/kind bug

What this PR does / why we need it:
The existing AWSMachine reconciliation loop blocks on terminating an ec2 instance and waits for it to confirm the termination with AWS, as well as blocking on ec2 instance creation and waits up to one minute, for no discernible reason I could detect. This PR removes the blocking behavior, proceeding with reconciliation immediately after the AWS sdk confirms success on the API call.

As a result, this change greatly speeds up large machine deployment rollouts. Previously, very large Machine Deployments (> 100 machines) would take several hours for the machine sets to provision new nodes and tear down machines. This was mostly a result of the reconciliation blocking for each instance creation and delete, and while blocked, prevented additional concurrent work from proceeding, as there is a fixed number of concurrent AWSMachine reconcile workers allowed (controlled by --awsmachine-concurrency).

We tested this change in one of our clusters - timing it with 600 nodes, split into 3 node groups (really, 1 node group, duplicated for 3 az's in us-east-2). We were testing with t3.small instances which have just enough CPU to run kubelet and the daemonset pods we deploy.

We captured metrics from both the CAPI and CAPA controllers as well as metrics about the state of the CAPI resources in the kubernetes cluster. The most meaningful thing to note in the metrics is the time scale. The rollout took ~2 hours to complete on the mainline version and < 45 minutes on this branch. The CAPA unfinished workqueue metric is also significantly different before and after this change.

This is metrics without the change (running on capi v1.3.2, capa v2.0.2 official release):

Screenshot 2023-02-23 at 10-30-51 Kubernetes Cluster Autoscaling Datadog

This is the metrics from the same cluster after this change (running on capi v1.3.2, this PR branch):

Screenshot 2023-02-23 at 10-31-38 Kubernetes Cluster Autoscaling Datadog

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:
Split into two commits for review, first commit is the modified behavior. The second commit is updating the e2e test sites for the new reconciliation behavior.

Checklist:

  • squashed commits
  • includes documentation
  • adds unit tests
  • adds or updates e2e tests

Release note:

Modifies the AWSMachine reconciliation to no longer block on ec2 termination or creation, resulting in faster rollouts for large Machine Deployments.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 21, 2023
@k8s-ci-robot k8s-ci-robot added needs-priority needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 21, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @cnmcavoy. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 21, 2023
@cnmcavoy cnmcavoy force-pushed the cnmcavoy/speedy-provision-and-termination branch from 30babb9 to b2963ff Compare February 22, 2023 21:32
@cnmcavoy cnmcavoy changed the title WIP: Modify AWSMachine reconciliation behavior to terminate and create instances without blocking Modify AWSMachine reconciliation behavior to terminate and create instances without blocking Feb 22, 2023
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 22, 2023
@richardcase
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 23, 2023
@richardcase
Copy link
Member

/test ?

@k8s-ci-robot
Copy link
Contributor

@richardcase: The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-aws-build
  • /test pull-cluster-api-provider-aws-test
  • /test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-aws-apidiff-main
  • /test pull-cluster-api-provider-aws-e2e
  • /test pull-cluster-api-provider-aws-e2e-blocking
  • /test pull-cluster-api-provider-aws-e2e-clusterclass
  • /test pull-cluster-api-provider-aws-e2e-conformance
  • /test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-aws-e2e-eks
  • /test pull-cluster-api-provider-aws-e2e-eks-gc
  • /test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-aws-apidiff-main
  • pull-cluster-api-provider-aws-build
  • pull-cluster-api-provider-aws-test
  • pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardcase
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-blocking

Copy link
Member

@Ankitasw Ankitasw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cnmcavoy AFAIR, the blocking call was used such that instance termination happens successfully. As the operation takes some time to finish, hence is the blocking call. What would happen if instance termination doesn't finish at that point, and happen to be at failed state in the later point of time(due to some xyz reason) but we already set the conditions as deleted, but the instance is left behind and may fail the delete operation?
Thinking out loud here, about the downside of removing blocking calls

@cnmcavoy
Copy link
Contributor Author

@Ankitasw

Thanks for the additional context. If the intent is to ensure that the ec2 instance associated with the AWSMachine is in a terminated state (or absent entirely) we don't need to block, one option could be to requeue the AWSMachine to be re-reconciled in a later reconciliation and repeat this until the cloud provider state matches. Blocking still doesn't seem to be necessary to satisfy that requirement.

I can update this PR to include covering that edge case and requeue the AWSMachine after TerminateInstance until the lookup for the instance no longer succeeds.

@Ankitasw
Copy link
Member

Ankitasw commented Feb 28, 2023

I can update this PR to include covering that edge case and requeue the AWSMachine after TerminateInstance until the lookup for the instance no longer succeeds.

This sounds good to me, I was trying to implying the same 😄 I will take a look at the PR update. Thanks @cnmcavoy

@Ankitasw
Copy link
Member

/test pull-cluster-api-provider-aws-e2e-blocking

@Ankitasw
Copy link
Member

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@Ankitasw Ankitasw added kind/bug Categorizes issue or PR as related to a bug. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Feb 28, 2023
@k8s-ci-robot
Copy link
Contributor

@cnmcavoy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-aws-e2e-eks f5d1d45 link false /test pull-cluster-api-provider-aws-e2e-eks

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@Ankitasw
Copy link
Member

/lgtm
/approve

We can ignore that error, as its failing sometimes.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ankitasw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 28, 2023
@k8s-ci-robot k8s-ci-robot merged commit 8d64bc6 into kubernetes-sigs:main Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants