Azure: fix node removal race condition on VMSS deletion #95289

bpineau · 2020-10-04T09:20:19Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a VMSS is being deleted, instances are removed first. The VMSS itself will disappear once empty. That delay is generally enough for kube-controller-manager to delete the corresponding k8s nodes, but might not when busy or throttled (for instance).

If kubernetes nodes remains after their backing VMSS were removed, Azure cloud-provider will fail listing that VMSS VMs, and downstream callers (ie. InstanceExistsByProviderID) won't account those errors for a missing instance. The nodes will remain (still considered as "existing"), and controller-manager will indefinitely retry to VMSS VMs list it, draining API calls quotas, potentially causing throttling.

In practice a missing scale set implies instances attributed to that VMSS don't exists either: InstanceExistsByProviderID (part of the general cloud provider interface) should return false in that case.

The graph below shows the retries impact on ARM API calls count, on a cluster having such a leaked node, before and after that patch was deployed:

Which issue(s) this PR fixes:

Fixes #95288

Special notes for your reviewer:

Could we consider backporting that fix to 1.19?

Does this PR introduce a user-facing change?:

Gracefully delete nodes when their parent scale set went missing

/assign @andyzhangx @feiskyer
/sig cloud-provider
/area provider/azure

k8s-ci-robot · 2020-10-04T09:20:26Z

@bpineau: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-10-04T09:20:28Z

Hi @bpineau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

andyzhangx · 2020-10-04T09:25:23Z

/kind bug
/priority important-soon
/sig cloud-provider
/area provider/azure
/ok-to-test

andyzhangx · 2020-10-04T14:51:02Z

staging/src/k8s.io/legacy-cloud-providers/azure/azure_vmss.go

@@ -697,6 +699,9 @@ func (ss *scaleSet) listScaleSetVMs(scaleSetName, resourceGroup string) ([]compu
 	allVMs, rerr := ss.VirtualMachineScaleSetVMsClient.List(ctx, resourceGroup, scaleSetName, string(compute.InstanceView))
 	if rerr != nil {
 		klog.Errorf("VirtualMachineScaleSetVMsClient.List failed: %v", rerr)
+		if rerr.IsNotFound() {
+			return nil, ErrorVmssNotFound


in this case, what about return cloudprovider.InstanceNotFound directly? then it would save quite a lot code changes

updated accordingly

thanks for the contribution

When a VMSS is being deleted, instances are removed first. The VMSS itself will disappear once empty. That delay is generally enough for kube-controller-manager to delete the corresponding k8s nodes, but might not when busy or throttled (for instance). If kubernetes nodes remains after their backing VMSS were removed, Azure cloud-provider will fail listing that VMSS VMs, and downstream callers (ie. `InstanceExistsByProviderID`) won't account those errors for a missing instance. The nodes will remain (still considered as "existing"), and controller-manager will indefinitely retry to VMSS VMs list it, draining API calls quotas, potentially causing throttling. In practice a missing scale set implies instances attributed to that VMSS don't exists either: `InstanceExistsByProviderID` (part of the general cloud provider interface) should return false in that case.

andyzhangx

/lgtm
/approve

k8s-ci-robot · 2020-10-05T00:23:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyzhangx, bpineau

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~staging/src/k8s.io/legacy-cloud-providers/azure/OWNERS~~ [andyzhangx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bpineau · 2020-10-05T06:51:15Z

Thank you @andyzhangx !
Would it be possible to cherry-pick to 1.19?

andyzhangx · 2020-10-05T06:53:58Z

Thank you @andyzhangx !
Would it be possible to cherry-pick to 1.19?

@bpineau yes, will you do that?

bpineau · 2020-10-05T17:03:33Z

Yes, thanks, done in #95305

…5289-upstream-release-1.18 Automated cherry pick of #95289: Azure: fix node removal race condition on VMSS deletion

…9-upstream-release-1.19 Automated cherry pick of #95289 upstream release 1.19

k8s-ci-robot assigned andyzhangx Oct 4, 2020

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Oct 4, 2020

k8s-ci-robot assigned feiskyer Oct 4, 2020

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 4, 2020

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cloudprovider labels Oct 4, 2020

k8s-ci-robot requested review from brendandburns and karataliu October 4, 2020 09:21

bpineau force-pushed the fix-instanceexists-on-deleted-vmss branch from 1ef5787 to 4ecce5e Compare October 4, 2020 12:46

andyzhangx reviewed Oct 4, 2020

View reviewed changes

bpineau force-pushed the fix-instanceexists-on-deleted-vmss branch from 4ecce5e to ee7cd25 Compare October 4, 2020 16:07

andyzhangx approved these changes Oct 5, 2020

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 5, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2020

k8s-ci-robot merged commit 086b65a into kubernetes:master Oct 5, 2020

k8s-ci-robot added this to the v1.20 milestone Oct 5, 2020

bpineau mentioned this pull request Oct 5, 2020

Automated cherry pick of #95289 upstream release 1.19 #95305

Merged

andyzhangx mentioned this pull request Oct 8, 2020

Automated cherry pick of #95289: Azure: fix node removal race condition on VMSS deletion #95398

Merged

k8s-ci-robot added a commit that referenced this pull request Oct 11, 2020

Merge pull request #95398 from andyzhangx/automated-cherry-pick-of-#9…

fc09f0c

…5289-upstream-release-1.18 Automated cherry pick of #95289: Azure: fix node removal race condition on VMSS deletion

k8s-ci-robot added a commit that referenced this pull request Oct 13, 2020

Merge pull request #95305 from DataDog/automated-cherry-pick-of-#9528…

37babbd

…9-upstream-release-1.19 Automated cherry pick of #95289 upstream release 1.19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: fix node removal race condition on VMSS deletion #95289

Azure: fix node removal race condition on VMSS deletion #95289

bpineau commented Oct 4, 2020

k8s-ci-robot commented Oct 4, 2020

k8s-ci-robot commented Oct 4, 2020

andyzhangx commented Oct 4, 2020

andyzhangx Oct 4, 2020 •

edited

Loading

bpineau Oct 4, 2020

andyzhangx Oct 5, 2020

andyzhangx left a comment

k8s-ci-robot commented Oct 5, 2020

bpineau commented Oct 5, 2020

andyzhangx commented Oct 5, 2020

bpineau commented Oct 5, 2020

Azure: fix node removal race condition on VMSS deletion #95289

Azure: fix node removal race condition on VMSS deletion #95289

Conversation

bpineau commented Oct 4, 2020

k8s-ci-robot commented Oct 4, 2020

k8s-ci-robot commented Oct 4, 2020

andyzhangx commented Oct 4, 2020

andyzhangx Oct 4, 2020 • edited Loading

Choose a reason for hiding this comment

bpineau Oct 4, 2020

Choose a reason for hiding this comment

andyzhangx Oct 5, 2020

Choose a reason for hiding this comment

andyzhangx left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 5, 2020

bpineau commented Oct 5, 2020

andyzhangx commented Oct 5, 2020

bpineau commented Oct 5, 2020

andyzhangx Oct 4, 2020 •

edited

Loading