Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

domenicbozzuto · 2023-05-17T20:05:19Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

In #5548, I had added support for faster backoff on a failed provisioning state. We recently found a bug where a many simultaneous updates to the network profiles of a running VMs in a VMSS could sometimes fail with a ProvisioningState/failed/NetworkingInternalOperationError error. This would be reflected in the vm.ProvisioningState (which represents the status of the last provisioning operation on the VM), and would result in the VMSS size being decreased and the running VM instance being removed without first properly draining the running pods on the node.

This PR adds behavior to also consider the PowerState of the VM when determining if an OutOfResources error should be created -- it now only sets the OutOfResource error if the provisioning state is failed AND the instance is not running. I extended the test case I added in #5548 with several more configurations to ensure the power state behavior should be correct.

Reproduction with Fix

I was able to reproduce the issue and captured a situation where one of the existing VMs in a VMSS was running and got updated to a ProvisioningFailed state at the same time as a ProvisioningFailure on a newly created VM. The existing VM is reported as cloudprovider.InstanceRunning but the new instance triggers the provisioning failure codepath that produces an OutOfResources error.

"2023-05-17T19:43:47.151Z","Disabling scale-up for node group <vmss> until 2023-05-17 19:48:46.903222688 +0000 UTC m=+6595.425006719; errorClass=OutOfResource; errorCode=provisioning-state-failed"
...
"2023-05-17T19:43:47.135Z","VM /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1397 reports failed provisioning state with non-running power state: PowerState/stopped"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Failed for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1397"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Succeeded for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1390"
...
"2023-05-17T19:43:47.135Z","VM /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1330 reports a failed provisioning state but is running (PowerState/running)"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Failed for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1330"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

expand: "instanceView"

Setting the expand field of .virtualMachineScaleSetVMsClient.List to "instanceView" is required to retrieve the PowerState for the VMs. It looks like eventually using instanceView was referenced a few years ago when the provisioning state logic was first introduced. I've been running a build of the CA with this fix (and expand=instanceView) in some of our larger Azure clusters (1k nodes) without issue for over a week, and the net number of API calls the CA makes has not increased.

If there's some hesitancy towards setting expand=instanceView by default, it's probably something that could be configured via an environment variable flag (and would probably act as a pseudo-feature-gate for the fast backoff, as PowerState is assumed to be PowerState/running if the vm.InstanceView == nil, so in the event of a provisioning failure we'd always still report cloudprovider.InstanceRunning, which was the behavior before #5548) -- I'm open to exploring this if it's preferred!

PowerState

The azure-sdk-for-go does not directly expose the enumerations for the PowerState, even though there are official values for these in other languages like Java. I saw the legacy Azure cloud provider also had logic to parse the power state from an instanceView.Statuses, so I largely based this fix on that.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

The provisioning state reflects the status of the last provisioning action, which means the instance can enter a failed state after it's running. Protect against unnecessary scaledowns by checking the power state to avoid scaling down running VMs

domenicbozzuto · 2023-05-17T20:06:18Z

/area provider/azure

comtalyst · 2023-06-29T05:34:13Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

@@ -35,6 +35,18 @@ import (
 	"github.com/Azure/go-autorest/autorest/azure"
 )

+// PowerStates reflect the operational state of a VM
+// From https://learn.microsoft.com/en-us/java/api/com.microsoft.azure.management.compute.powerstate?view=azure-java-stable
+const (


Could we move this to azure_util.go?
Those are where the potentially sharable constants are located.

Let's also keep the naming consistent and descriptive about power state being from VM, something like vmPowerStateStarting.

Sure thing, moved the constants and related helper functions to azure_util.go. I also adjusted the naming of the helper functions to better reference that it's a VM power state.

comtalyst · 2023-06-29T05:34:18Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

+			klog.V(4).Infof("VM %s reports failed provisioning state with non-running power state: %s", resourceId, powerState)
+			status.State = cloudprovider.InstanceCreating
+			status.ErrorInfo = &cloudprovider.InstanceErrorInfo{
+				ErrorClass:   cloudprovider.OutOfResourcesErrorClass,


I don't think we should assumed all errors to be OutOfResourcesErrorClass. The case where other errors like #5548 could happen again.

Right now the only two ErrorClass options are OutOfResourcesErrorClass and OtherErrorClass, and their value at this point is purely informational (there's no conditional logic anywhere in the cluster-autoscaler that actually acts on the value of ErrorClass).

I'm open to changing to OtherErrorClass, but IMO OutOfResourcesErrorClass makes a bit more sense to me given the cloud provider itself is reporting a provisioning issue. WDYT?

I see your point—let's keep it that way.

comtalyst · 2023-06-29T05:34:21Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

+
+	// PowerState is not set if the VM is still creating (or has failed creation),
+	// so the absence of a PowerState is treated the same as a VM that is stopped
+	return PowerStateStopped


I think we should let PowerStateUnknown be the default in this case. The assumption in the comment could change anytime from the updates on their side.
And since this case is not having a behavior difference between returning PowerStateUnknown and PowerStateStopped for now, the earlier one would be safer.

Agreed, I've updated to make vmPowerStateUnknown the default case -- thanks!

comtalyst · 2023-06-29T05:34:23Z

cluster-autoscaler/cloudprovider/azure/azure_scale_set.go

+	return powerState == PowerStateRunning || powerState == PowerStateStarting
+}
+
+func isKnownPowerState(powerState string) bool {


I think functions like this and isRunningPowerState() could be in azure_util.go too. In this way we could manage the definitions of each power state from the same place.

* renames all PowerState* consts to vmPowerState* * moves vmPowerState* consts and helper functions to azure_util.go * changes default vmPowerState to vmPowerStateUnknown instead of vmPowerStateStopped when a power state is not set.

comtalyst · 2023-07-05T17:45:30Z

/lgtm

k8s-ci-robot · 2023-07-05T17:45:35Z

@comtalyst: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tallaxes · 2023-07-05T18:08:25Z

/lgtm

tallaxes · 2023-07-05T18:09:46Z

/approve

k8s-ci-robot · 2023-07-05T18:09:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: domenicbozzuto, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [tallaxes]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler labels May 17, 2023

k8s-ci-robot requested review from feiskyer and tallaxes May 17, 2023 20:05

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 17, 2023

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label May 17, 2023

comtalyst reviewed Jun 29, 2023

View reviewed changes

domenicbozzuto force-pushed the bugfix-azure-prevent-unneeded-scaledown branch from bfda39b to faaf972 Compare July 5, 2023 13:41

domenicbozzuto force-pushed the bugfix-azure-prevent-unneeded-scaledown branch from faaf972 to dbff9be Compare July 5, 2023 13:45

k8s-ci-robot assigned tallaxes Jul 5, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 5, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2023

k8s-ci-robot merged commit b569db4 into kubernetes:master Jul 5, 2023

comtalyst mentioned this pull request Aug 12, 2024

REQUEST: New membership for comtalyst kubernetes/org#5099

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

domenicbozzuto commented May 17, 2023

domenicbozzuto commented May 17, 2023

comtalyst Jun 29, 2023

domenicbozzuto Jul 5, 2023

comtalyst Jun 29, 2023

domenicbozzuto Jul 5, 2023

comtalyst Jul 5, 2023

comtalyst Jun 29, 2023

domenicbozzuto Jul 5, 2023

comtalyst Jun 29, 2023

comtalyst commented Jul 5, 2023

k8s-ci-robot commented Jul 5, 2023

tallaxes commented Jul 5, 2023

tallaxes commented Jul 5, 2023

k8s-ci-robot commented Jul 5, 2023

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

Conversation

domenicbozzuto commented May 17, 2023

What type of PR is this?

What this PR does / why we need it:

Reproduction with Fix

Which issue(s) this PR fixes:

Special notes for your reviewer:

expand: "instanceView"

PowerState

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

domenicbozzuto commented May 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comtalyst commented Jul 5, 2023

k8s-ci-robot commented Jul 5, 2023

tallaxes commented Jul 5, 2023

tallaxes commented Jul 5, 2023

k8s-ci-robot commented Jul 5, 2023