Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767

Conversation

domenicbozzuto
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

In #5548, I had added support for faster backoff on a failed provisioning state. We recently found a bug where a many simultaneous updates to the network profiles of a running VMs in a VMSS could sometimes fail with a ProvisioningState/failed/NetworkingInternalOperationError error. This would be reflected in the vm.ProvisioningState (which represents the status of the last provisioning operation on the VM), and would result in the VMSS size being decreased and the running VM instance being removed without first properly draining the running pods on the node.

This PR adds behavior to also consider the PowerState of the VM when determining if an OutOfResources error should be created -- it now only sets the OutOfResource error if the provisioning state is failed AND the instance is not running. I extended the test case I added in #5548 with several more configurations to ensure the power state behavior should be correct.

Reproduction with Fix

I was able to reproduce the issue and captured a situation where one of the existing VMs in a VMSS was running and got updated to a ProvisioningFailed state at the same time as a ProvisioningFailure on a newly created VM. The existing VM is reported as cloudprovider.InstanceRunning but the new instance triggers the provisioning failure codepath that produces an OutOfResources error.

"2023-05-17T19:43:47.151Z","Disabling scale-up for node group <vmss> until 2023-05-17 19:48:46.903222688 +0000 UTC m=+6595.425006719; errorClass=OutOfResource; errorCode=provisioning-state-failed"
...
"2023-05-17T19:43:47.135Z","VM /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1397 reports failed provisioning state with non-running power state: PowerState/stopped"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Failed for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1397"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Succeeded for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1390"
...
"2023-05-17T19:43:47.135Z","VM /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1330 reports a failed provisioning state but is running (PowerState/running)"
"2023-05-17T19:43:47.135Z","Getting vm instance provisioning state Failed for /subscriptions/<subscription>/resourceGroups/<resourceGroup>/providers/Microsoft.Compute/virtualMachineScaleSets/<vmss>/virtualMachines/1330"

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

expand: "instanceView"

Setting the expand field of .virtualMachineScaleSetVMsClient.List to "instanceView" is required to retrieve the PowerState for the VMs. It looks like eventually using instanceView was referenced a few years ago when the provisioning state logic was first introduced. I've been running a build of the CA with this fix (and expand=instanceView) in some of our larger Azure clusters (1k nodes) without issue for over a week, and the net number of API calls the CA makes has not increased.

If there's some hesitancy towards setting expand=instanceView by default, it's probably something that could be configured via an environment variable flag (and would probably act as a pseudo-feature-gate for the fast backoff, as PowerState is assumed to be PowerState/running if the vm.InstanceView == nil, so in the event of a provisioning failure we'd always still report cloudprovider.InstanceRunning, which was the behavior before #5548) -- I'm open to exploring this if it's preferred!

PowerState

The azure-sdk-for-go does not directly expose the enumerations for the PowerState, even though there are official values for these in other languages like Java. I saw the legacy Azure cloud provider also had logic to parse the power state from an instanceView.Statuses, so I largely based this fix on that.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


The provisioning state reflects the status of the last provisioning
action, which means the instance can enter a failed state after it's
running. Protect against unnecessary scaledowns by checking the
power state to avoid scaling down running VMs
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cluster-autoscaler labels May 17, 2023
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and tallaxes May 17, 2023 20:05
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 17, 2023
@domenicbozzuto
Copy link
Contributor Author

/area provider/azure

@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label May 17, 2023
@@ -35,6 +35,18 @@ import (
"github.com/Azure/go-autorest/autorest/azure"
)

// PowerStates reflect the operational state of a VM
// From https://learn.microsoft.com/en-us/java/api/com.microsoft.azure.management.compute.powerstate?view=azure-java-stable
const (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to azure_util.go?
Those are where the potentially sharable constants are located.

Let's also keep the naming consistent and descriptive about power state being from VM, something like vmPowerStateStarting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, moved the constants and related helper functions to azure_util.go. I also adjusted the naming of the helper functions to better reference that it's a VM power state.

klog.V(4).Infof("VM %s reports failed provisioning state with non-running power state: %s", resourceId, powerState)
status.State = cloudprovider.InstanceCreating
status.ErrorInfo = &cloudprovider.InstanceErrorInfo{
ErrorClass: cloudprovider.OutOfResourcesErrorClass,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should assumed all errors to be OutOfResourcesErrorClass. The case where other errors like #5548 could happen again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the only two ErrorClass options are OutOfResourcesErrorClass and OtherErrorClass, and their value at this point is purely informational (there's no conditional logic anywhere in the cluster-autoscaler that actually acts on the value of ErrorClass).

I'm open to changing to OtherErrorClass, but IMO OutOfResourcesErrorClass makes a bit more sense to me given the cloud provider itself is reporting a provisioning issue. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point—let's keep it that way.


// PowerState is not set if the VM is still creating (or has failed creation),
// so the absence of a PowerState is treated the same as a VM that is stopped
return PowerStateStopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should let PowerStateUnknown be the default in this case. The assumption in the comment could change anytime from the updates on their side.
And since this case is not having a behavior difference between returning PowerStateUnknown and PowerStateStopped for now, the earlier one would be safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've updated to make vmPowerStateUnknown the default case -- thanks!

return powerState == PowerStateRunning || powerState == PowerStateStarting
}

func isKnownPowerState(powerState string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think functions like this and isRunningPowerState() could be in azure_util.go too. In this way we could manage the definitions of each power state from the same place.

@domenicbozzuto domenicbozzuto force-pushed the bugfix-azure-prevent-unneeded-scaledown branch from bfda39b to faaf972 Compare July 5, 2023 13:41
* renames all PowerState* consts to vmPowerState*
* moves vmPowerState* consts and helper functions to azure_util.go
* changes default vmPowerState to vmPowerStateUnknown instead of vmPowerStateStopped when a power state is not set.
@domenicbozzuto domenicbozzuto force-pushed the bugfix-azure-prevent-unneeded-scaledown branch from faaf972 to dbff9be Compare July 5, 2023 13:45
@comtalyst
Copy link
Contributor

/lgtm

@k8s-ci-robot
Copy link
Contributor

@comtalyst: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tallaxes
Copy link
Contributor

tallaxes commented Jul 5, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 5, 2023
@tallaxes
Copy link
Contributor

tallaxes commented Jul 5, 2023

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: domenicbozzuto, tallaxes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 5, 2023
@k8s-ci-robot k8s-ci-robot merged commit b569db4 into kubernetes:master Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants