Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: fix node removal race condition on VMSS deletion #95289

Merged

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented Oct 4, 2020

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a VMSS is being deleted, instances are removed first. The VMSS itself will disappear once empty. That delay is generally enough for kube-controller-manager to delete the corresponding k8s nodes, but might not when busy or throttled (for instance).

If kubernetes nodes remains after their backing VMSS were removed, Azure cloud-provider will fail listing that VMSS VMs, and downstream callers (ie. InstanceExistsByProviderID) won't account those errors for a missing instance. The nodes will remain (still considered as "existing"), and controller-manager will indefinitely retry to VMSS VMs list it, draining API calls quotas, potentially causing throttling.

In practice a missing scale set implies instances attributed to that VMSS don't exists either: InstanceExistsByProviderID (part of the general cloud provider interface) should return false in that case.

The graph below shows the retries impact on ARM API calls count, on a cluster having such a leaked node, before and after that patch was deployed:

kube-controller-manager_ vmss vm list calls (1)

Which issue(s) this PR fixes:

Fixes #95288

Special notes for your reviewer:

Could we consider backporting that fix to 1.19?

Does this PR introduce a user-facing change?:

Gracefully delete nodes when their parent scale set went missing

/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure

@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 4, 2020
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/provider/azure Issues or PRs related to azure provider labels Oct 4, 2020
@k8s-ci-robot
Copy link
Contributor

@bpineau: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 4, 2020
@k8s-ci-robot
Copy link
Contributor

Hi @bpineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cloudprovider labels Oct 4, 2020
@andyzhangx
Copy link
Member

/kind bug
/priority important-soon
/sig cloud-provider
/area provider/azure
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 4, 2020
@bpineau bpineau force-pushed the fix-instanceexists-on-deleted-vmss branch from 1ef5787 to 4ecce5e Compare October 4, 2020 12:46
@@ -697,6 +699,9 @@ func (ss *scaleSet) listScaleSetVMs(scaleSetName, resourceGroup string) ([]compu
allVMs, rerr := ss.VirtualMachineScaleSetVMsClient.List(ctx, resourceGroup, scaleSetName, string(compute.InstanceView))
if rerr != nil {
klog.Errorf("VirtualMachineScaleSetVMsClient.List failed: %v", rerr)
if rerr.IsNotFound() {
return nil, ErrorVmssNotFound
Copy link
Member

@andyzhangx andyzhangx Oct 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, what about return cloudprovider.InstanceNotFound directly? then it would save quite a lot code changes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated accordingly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution

When a VMSS is being deleted, instances are removed first. The VMSS
itself will disappear once empty. That delay is generally enough for
kube-controller-manager to delete the corresponding k8s nodes, but
might not when busy or throttled (for instance).

If kubernetes nodes remains after their backing VMSS were removed, Azure
cloud-provider will fail listing that VMSS VMs, and downstream callers
(ie. `InstanceExistsByProviderID`) won't account those errors for a
missing instance. The nodes will remain (still considered as "existing"),
and controller-manager will indefinitely retry to VMSS VMs list it,
draining API calls quotas, potentially causing throttling.

In practice a missing scale set implies instances attributed to that
VMSS don't exists either: `InstanceExistsByProviderID` (part of the
general cloud provider interface) should return false in that case.
@bpineau bpineau force-pushed the fix-instanceexists-on-deleted-vmss branch from 4ecce5e to ee7cd25 Compare October 4, 2020 16:07
Copy link
Member

@andyzhangx andyzhangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, bpineau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2020
@k8s-ci-robot k8s-ci-robot merged commit 086b65a into kubernetes:master Oct 5, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Oct 5, 2020
@bpineau
Copy link
Contributor Author

bpineau commented Oct 5, 2020

Thank you @andyzhangx !
Would it be possible to cherry-pick to 1.19?

@andyzhangx
Copy link
Member

Thank you @andyzhangx !
Would it be possible to cherry-pick to 1.19?

@bpineau yes, will you do that?

@bpineau
Copy link
Contributor Author

bpineau commented Oct 5, 2020

Yes, thanks, done in #95305

k8s-ci-robot added a commit that referenced this pull request Oct 11, 2020
…5289-upstream-release-1.18

Automated cherry pick of #95289: Azure: fix node removal race condition on VMSS deletion
k8s-ci-robot added a commit that referenced this pull request Oct 13, 2020
…9-upstream-release-1.19

Automated cherry pick of #95289 upstream release 1.19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed nodes checks when their Azure VMSS was deleted
4 participants