-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767
Azure: Bugfix: Check PowerState before setting OutOfResources on instance #5767
Conversation
The provisioning state reflects the status of the last provisioning action, which means the instance can enter a failed state after it's running. Protect against unnecessary scaledowns by checking the power state to avoid scaling down running VMs
/area provider/azure |
@@ -35,6 +35,18 @@ import ( | |||
"github.com/Azure/go-autorest/autorest/azure" | |||
) | |||
|
|||
// PowerStates reflect the operational state of a VM | |||
// From https://learn.microsoft.com/en-us/java/api/com.microsoft.azure.management.compute.powerstate?view=azure-java-stable | |||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move this to azure_util.go
?
Those are where the potentially sharable constants are located.
Let's also keep the naming consistent and descriptive about power state being from VM, something like vmPowerStateStarting
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, moved the constants and related helper functions to azure_util.go
. I also adjusted the naming of the helper functions to better reference that it's a VM power state.
klog.V(4).Infof("VM %s reports failed provisioning state with non-running power state: %s", resourceId, powerState) | ||
status.State = cloudprovider.InstanceCreating | ||
status.ErrorInfo = &cloudprovider.InstanceErrorInfo{ | ||
ErrorClass: cloudprovider.OutOfResourcesErrorClass, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should assumed all errors to be OutOfResourcesErrorClass
. The case where other errors like #5548 could happen again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the only two ErrorClass
options are OutOfResourcesErrorClass
and OtherErrorClass
, and their value at this point is purely informational (there's no conditional logic anywhere in the cluster-autoscaler that actually acts on the value of ErrorClass
).
I'm open to changing to OtherErrorClass
, but IMO OutOfResourcesErrorClass
makes a bit more sense to me given the cloud provider itself is reporting a provisioning issue. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point—let's keep it that way.
|
||
// PowerState is not set if the VM is still creating (or has failed creation), | ||
// so the absence of a PowerState is treated the same as a VM that is stopped | ||
return PowerStateStopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should let PowerStateUnknown
be the default in this case. The assumption in the comment could change anytime from the updates on their side.
And since this case is not having a behavior difference between returning PowerStateUnknown
and PowerStateStopped
for now, the earlier one would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've updated to make vmPowerStateUnknown
the default case -- thanks!
return powerState == PowerStateRunning || powerState == PowerStateStarting | ||
} | ||
|
||
func isKnownPowerState(powerState string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think functions like this and isRunningPowerState()
could be in azure_util.go
too. In this way we could manage the definitions of each power state from the same place.
bfda39b
to
faaf972
Compare
* renames all PowerState* consts to vmPowerState* * moves vmPowerState* consts and helper functions to azure_util.go * changes default vmPowerState to vmPowerStateUnknown instead of vmPowerStateStopped when a power state is not set.
faaf972
to
dbff9be
Compare
/lgtm |
@comtalyst: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: domenicbozzuto, tallaxes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
In #5548, I had added support for faster backoff on a failed provisioning state. We recently found a bug where a many simultaneous updates to the network profiles of a running VMs in a VMSS could sometimes fail with a
ProvisioningState/failed/NetworkingInternalOperationError
error. This would be reflected in thevm.ProvisioningState
(which represents the status of the last provisioning operation on the VM), and would result in the VMSS size being decreased and the running VM instance being removed without first properly draining the running pods on the node.This PR adds behavior to also consider the PowerState of the VM when determining if an OutOfResources error should be created -- it now only sets the OutOfResource error if the provisioning state is failed AND the instance is not running. I extended the test case I added in #5548 with several more configurations to ensure the power state behavior should be correct.
Reproduction with Fix
I was able to reproduce the issue and captured a situation where one of the existing VMs in a VMSS was running and got updated to a ProvisioningFailed state at the same time as a ProvisioningFailure on a newly created VM. The existing VM is reported as
cloudprovider.InstanceRunning
but the new instance triggers the provisioning failure codepath that produces anOutOfResources
error.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
expand: "instanceView"
Setting the
expand
field of.virtualMachineScaleSetVMsClient.List
to"instanceView"
is required to retrieve the PowerState for the VMs. It looks like eventually usinginstanceView
was referenced a few years ago when the provisioning state logic was first introduced. I've been running a build of the CA with this fix (andexpand=instanceView
) in some of our larger Azure clusters (1k nodes) without issue for over a week, and the net number of API calls the CA makes has not increased.If there's some hesitancy towards setting
expand=instanceView
by default, it's probably something that could be configured via an environment variable flag (and would probably act as a pseudo-feature-gate for the fast backoff, asPowerState
is assumed to bePowerState/running
if thevm.InstanceView == nil
, so in the event of a provisioning failure we'd always still reportcloudprovider.InstanceRunning
, which was the behavior before #5548) -- I'm open to exploring this if it's preferred!PowerState
The azure-sdk-for-go does not directly expose the enumerations for the PowerState, even though there are official values for these in other languages like Java. I saw the legacy Azure cloud provider also had logic to parse the power state from an instanceView.Statuses, so I largely based this fix on that.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: