-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452
Comments
/area provider/aws |
Additional note on the workaround. We essentially restart the autoscaler deployment as the last step of our automation workflow.
|
I was experiencing the same behavior when the NodeTermination Handler marked the last node in the node pool with his taint and ASG reached 0 size. After this ClusterAutoscaler is not able to scale it up because of
CA pod restart helps |
Confirming replication of this issue on a production cluster, helm chart 9.25.0 version 1.24.0 |
I have the same issues if I create mixed instance types in the same node group, I only kept one instance type in the different node groups, the taint feature can work if the node scales up from 0. desiredCapacity: 0
minSize: 0
maxSize: 10
tags:
k8s.io/cluster-autoscaler/enabled: "true"
k8s.io/cluster-autoscaler/blocknode: "owned"
k8s.io/cluster-autoscaler/node-template/taint/xxx.com/name: "true:NoSchedule"
taints:
- key: xxx.com/name
value: true
effect: NoSchedule |
The version 1.30.1 has the same issue.
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
/lifecycle frozen |
Isn't this fixed by this? |
Hi @icelava @ivan-morhun, Could you please check with CA v1.31, as PR #6482 is merged in CA 1.31.0 |
I currently have exactly 1.31 in one cluster and the issue still exists. |
Thanks @ivan-morhun for the confirmation. |
cc @drmorr0 |
We encountered pretty much this exact same issue running CA 1.28.6 with the ClusterAPI "cloud provider". We add a taint during node initialization between the node joining and CNI becoming ready. |
We managed to fix the issue with the clusterAPI provider by setting the There also is the edit: the |
@phiphi282 Thanks for the advice. |
Which component are you using?: cluster-autoscaler
What version of the component are you using?: cluster-autoscaler
Component version: helm chart 9.26.0 cluster-autoscaler 1.28.2
What k8s version are you using (
kubectl version
)?: 1.28kubectl version
OutputWhat environment is this in?: AWS EKS; managed node groups
What did you expect to happen?: Untainted node group be able launch nodes and schedule pods again.
What happened instead?: autoscaler still thinks node group has the long-gone node with taint and thus won't launch another node instance, even though the taint has been removed from node group.
How to reproduce it (as minimally and precisely as possible):
Taint node group to evict pods
aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints addOrUpdateTaints={key=cost,value=true,effect=NO_EXECUTE}
Pods get evicted and nodes eventually terminated to zero count.
Untaint node group to re-host pods.
aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints removeTaints={key=cost,value=true,effect=NO_EXECUTE}
Pod remains in perpetual pending state, as per events above, with autoscaler thinking the the old node with taint still around and refuses to launch another node in its place (without the taint).
Anything else we need to know?:
Workaround to deliberately kill autoscaler pod, so replacement autoscaler pod with no memory of the past can correctly auto-discover the node group and launch node to host the pod as per events above.
It seems like autoscaler wants to hang on to outdated historical data of terminated nodes in node group "1 node(s) had untolerated taint". It should be describing the node groups afresh to determine where to launch a new node.
The text was updated successfully, but these errors were encountered: