Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Azure Provider HasInstance implementation #6956

Conversation

Bryce-Soghigian
Copy link
Member

@Bryce-Soghigian Bryce-Soghigian commented Jun 21, 2024

What type of PR is this?

/kind bug
/kind regression

What this PR does / why we need it:

CA fails to scale up or cancel in progress schaledown when there are unschedulable pods. Stealing this description from the aws provider implementation.

I think the description of #5054 (comment) explains it well:
...original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

Which issue(s) this PR fixes:

Special notes for your reviewer:

This PR introduces the HasInstance method to the Azure provider for Cluster Autoscaler. The primary purpose of this method is to ascertain whether a given node has a corresponding instance in the Azure cloud provider. This implementation helps to prevent the undercount of existing VMs and addresses issues related to the taint-based overcount of deleted VMs.

• The HasInstance method ensures that if it is uncertain whether an instance exists, it returns an error instead of false, nil. This approach enforces a fallback to the taint-based determination method, providing a more reliable count of existing VMs.
• If the instance exists: return true, nil
• If the instance does not exist: return *, ErrNotImplemented (consider using a custom error for autoscaled nodes)
• For unimplemented cases: return *, ErrNotImplemented
• For any other errors: return *, error
• ErrNotImplemented is used for silent fallback, while any other errors will be logged for further investigation.

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. area/cluster-autoscaler labels Jun 21, 2024
@k8s-ci-robot k8s-ci-robot requested a review from jackfrancis June 21, 2024 16:43
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 21, 2024
@k8s-ci-robot k8s-ci-robot requested a review from nilo19 June 21, 2024 16:43
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 21, 2024
@Bryce-Soghigian
Copy link
Member Author

/test all

@Bryce-Soghigian Bryce-Soghigian marked this pull request as ready for review June 21, 2024 18:38
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 21, 2024
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, with minor feedback, and some comments re implication of using cache.
How was this tested? Can we add unit tests? E2E tests?

cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved
cluster-autoscaler/cloudprovider/azure/azure_cache.go Outdated Show resolved Hide resolved
@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/azure/has-instance-impl branch from f0d3407 to ea410de Compare July 12, 2024 23:37
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 16, 2024
@Bryce-Soghigian Bryce-Soghigian force-pushed the bsoghigian/azure/has-instance-impl branch from ca72ce3 to 34a26ee Compare July 31, 2024 16:36
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jul 31, 2024
@tallaxes
Copy link
Contributor

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants