[BUG] Cluster autoscaler bug requires Azure specific implementation to resolve #4286
Labels
action-required
bug
cluster-autoscaler
Scale and Performance
Use this for any AKS scale or performance related issue
Describe the bug
There is an issue in cluster-autoscaler described in kubernetes/autoscaler#4456
which was fixed for some cloud providers but which requires an implementation of the HasInstance method of the AzureCloudProvider to be fixed on AKS.
The gist of the issue is that there are cases when cluster-autoscaler scales down a node but pods can prevent the node from being completely drained and removed (e.g. due to long termination grace periods) and leave the node in a state where cluster-autoscaler still thinks it counts towards the number of available nodes and so does not scale up a new node, but new pods are not able to be scheduled on the old node since it is tainted with ToBeDeletedByClusterAutoscaler which leads to pods getting stuck in Pending and cluster-autoscaler not scaling up a new node for them or cancelling the scale-down of the tainted node.
For some more background:
This issue was attempted to be fixed in kubernetes/autoscaler#4211 and kubernetes/autoscaler#4896 then was then reverted in kubernetes/autoscaler#5023, and fixed again in kubernetes/autoscaler#5054 but this fix does not work for AKS since it relies on cloud provider specific implementation details to be implemented for cluster-autoscaler to know whether a node actually exists or not, based on this comment: kubernetes/autoscaler#5054 (comment)
In order for this fix to work correctly on AKS the following needs to be implemented:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_cloud_provider.go#L125
similar to how this was implemented on AWS here: kubernetes/autoscaler#5632
To Reproduce
See linked cluster-autoscaler issues
Expected behavior
Cluster autoscaler should be able to use the HasInstance method to determine if the node exists on AKS rather than falling back to the broken logic that relies on the ToBeDeletedByClusterAutoscaler taint
Environment (please complete the following information):
Affects all recent AKS versions as far as I am aware. We are seeing this on 1.27 specifically
Additional context
N/A
The text was updated successfully, but these errors were encountered: