cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

icelava · 2024-01-17T07:22:02Z

Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler

Component version: helm chart 9.26.0 cluster-autoscaler 1.28.2

What k8s version are you using (kubectl version)?: 1.28

kubectl version Output

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5-eks-5e0fdde

What environment is this in?: AWS EKS; managed node groups

What did you expect to happen?: Untainted node group be able launch nodes and schedule pods again.

What happened instead?: autoscaler still thinks node group has the long-gone node with taint and thus won't launch another node instance, even though the taint has been removed from node group.

│ Events:                                                                                                              │
│   Type     Reason             Age                     From                Message                                    │
│   ----     ------             ----                    ----                -------                                    │
│   Normal   NotTriggerScaleUp  22m (x385 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}  │
│   Normal   NotTriggerScaleUp  12m (x179 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) had untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector  │
│   Normal   NotTriggerScaleUp  7m11s (x1152 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had │
│  untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector, 3 max node group size reached  │
│   Warning  FailedScheduling   5m (x161 over 19h)      default-scheduler   0/8 nodes are available: 8 node(s) didn't  │
│ match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling. │
│ .                                                                                                                    │
│   Normal   NotTriggerScaleUp  2m10s (x2508 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) did │
│ n't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}, 3 max node group size reached  │
│   Normal   TriggeredScaleUp   50s                     cluster-autoscaler  pod triggered scale-up: [{eks-ZeroNodes-ce │
│ c688d4-02dd-c7f5-6bd1-be1a14735f61 0->1 (max: 1)}]                                                                   │
│   Normal   Scheduled          3s                      default-scheduler   Successfully assigned default/nginx-test t │
│ o ip-10-0-76-17.ap-southeast-1.compute.internal                                                                      │
│   Normal   Pulling            2s                      kubelet             Pulling image "nginx:latest"               │

How to reproduce it (as minimally and precisely as possible):

Taint node group to evict pods

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints addOrUpdateTaints={key=cost,value=true,effect=NO_EXECUTE}

Pods get evicted and nodes eventually terminated to zero count.

Untaint node group to re-host pods.

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints removeTaints={key=cost,value=true,effect=NO_EXECUTE}

Pod remains in perpetual pending state, as per events above, with autoscaler thinking the the old node with taint still around and refuses to launch another node in its place (without the taint).

Anything else we need to know?:
Workaround to deliberately kill autoscaler pod, so replacement autoscaler pod with no memory of the past can correctly auto-discover the node group and launch node to host the pod as per events above.

It seems like autoscaler wants to hang on to outdated historical data of terminated nodes in node group "1 node(s) had untolerated taint". It should be describing the node groups afresh to determine where to launch a new node.

The text was updated successfully, but these errors were encountered:

Shubham82 · 2024-01-18T05:49:36Z

/area provider/aws

icelava · 2024-01-22T08:18:49Z

Additional note on the workaround. We essentially restart the autoscaler deployment as the last step of our automation workflow.

kubectl rollout restart deployment/aws-cluster-autoscaler -n kube-system

ivan-morhun · 2024-01-24T08:32:26Z

I was experiencing the same behavior when the NodeTermination Handler marked the last node in the node pool with his taint and ASG reached 0 size. After this ClusterAutoscaler is not able to scale it up because of

2024-01-24T08:58:11+03:00 I0124 05:58:11.046721       1 orchestrator.go:546] Pod gitlab-runner/runner-eucgy1fpg-project-517-concurrent-1-f41lk8jv can't be scheduled on ciq-ci-gitlab-agents2023101009545070710000000e, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"gitlab-agent", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"aws-node-termination-handler/asg-lifecycle-termination", Value:"asg-lifecycle-term-65653465366439612d303362612d373737302d353735", Effect:"NoExecute", TimeAdded:<nil>}}

CA pod restart helps

linxcat · 2024-02-28T22:13:47Z

Confirming replication of this issue on a production cluster, helm chart 9.25.0 version 1.24.0

nooperpudd · 2024-04-09T15:54:48Z

I have the same issues if I create mixed instance types in the same node group, I only kept one instance type in the different node groups, the taint feature can work if the node scales up from 0.

desiredCapacity: 0
minSize: 0
maxSize: 10 
tags:
  k8s.io/cluster-autoscaler/enabled: "true"
  k8s.io/cluster-autoscaler/blocknode: "owned"
  k8s.io/cluster-autoscaler/node-template/taint/xxx.com/name: "true:NoSchedule"

taints:
  - key: xxx.com/name
    value: true
    effect: NoSchedule

ivan-morhun · 2024-07-02T08:54:37Z

The version 1.30.1 has the same issue.
Node was terminated last night, ASG has desired capacity 0, but CA still "see" the node in the cluster and doesn't scale up the ASG

{"ts":1719909365710.1602,"caller":"orchestrator/orchestrator.go:565","msg":"Pod jenkins-aqa/aqa-build-agent-235-qjzfw-cw6lm can't be scheduled on ciq-ci-jenkins-aqa-tests-agents20230808050032196800000003, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:\"aws-node-termination-handler/asg-lifecycle-termination\", Value:\"asg-lifecycle-term-34363838643263332d303665352d326261662d333838\", Effect:\"NoExecute\", TimeAdded:<nil>}}","v":2}

k8s-triage-robot · 2024-09-30T09:01:26Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-30T09:49:52Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ivan-morhun · 2024-10-30T09:53:33Z

/remove-lifecycle rotten

Shubham82 · 2024-10-30T13:00:41Z

/lifecycle frozen

mseiwald · 2024-11-15T08:52:40Z

Isn't this fixed by this?

Shubham82 · 2024-11-20T11:21:42Z

Isn't this fixed by this?

Hi @icelava @ivan-morhun, Could you please check with CA v1.31, as PR #6482 is merged in CA 1.31.0
Please see the Changelog: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.31.0

ivan-morhun · 2024-11-21T08:30:16Z

I currently have exactly 1.31 in one cluster and the issue still exists.
CA sees the ghost "dead" node with toleration from the NodeTermination Handler
I restart the CA hourly to "reset" its state...
Thanks @Shubham82

Shubham82 · 2024-11-21T08:44:25Z

I currently have exactly 1.31 in one cluster and the issue still exists. CA sees the ghost "dead" node with toleration from the NodeTermination Handler I restart the CA hourly to "reset" its state... Thanks @Shubham82

Thanks @ivan-morhun for the confirmation.

Shubham82 · 2024-11-21T08:45:41Z

cc @drmorr0
could you please take a look?

squ94wk · 2024-12-12T13:07:12Z

We encountered pretty much this exact same issue running CA 1.28.6 with the ClusterAPI "cloud provider".
Our MachineDeployment has expected node taints but also the respective capacity.cluster-autoscaler.kubernetes.io/taints annotation for scaling up from zero.

We add a taint during node initialization between the node joining and CNI becoming ready.
CA still thinks nodes resulting from the MD will have this taint.

phiphi282 · 2024-12-20T10:58:36Z

We managed to fix the issue with the clusterAPI provider by setting the --status-taint flag for the autoscaler.
This way our transient taint gets ignored by the autoscaler when creating the node template.

There also is the --node-info-cache-expire-time flag that has a default expire time of 10 years. By setting this to a lower value you don't have to restart the pod anymore.
This also helps if the nodegroup gets changed while scaled to 0 which would also not get noticed by the autoscaler.

edit: the --status-taint flag got introduced in CA 1.29 :)

ivan-morhun · 2024-12-23T08:55:01Z

@phiphi282 Thanks for the advice.
The --status-taint flag doesn't suit us, whereas --node-info-cache-expire-time looks much better than the Jenkins job that restarts CA.

icelava added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2024

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 18, 2024

towca added the area/cluster-autoscaler label Mar 21, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 30, 2024

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

icelava commented Jan 17, 2024 •

edited

Loading

Shubham82 commented Jan 18, 2024

icelava commented Jan 22, 2024

ivan-morhun commented Jan 24, 2024

linxcat commented Feb 28, 2024

nooperpudd commented Apr 9, 2024 •

edited

Loading

ivan-morhun commented Jul 2, 2024

k8s-triage-robot commented Sep 30, 2024

k8s-triage-robot commented Oct 30, 2024

ivan-morhun commented Oct 30, 2024

Shubham82 commented Oct 30, 2024

mseiwald commented Nov 15, 2024

Shubham82 commented Nov 20, 2024

ivan-morhun commented Nov 21, 2024 •

edited

Loading

Shubham82 commented Nov 21, 2024

Shubham82 commented Nov 21, 2024

squ94wk commented Dec 12, 2024

phiphi282 commented Dec 20, 2024 •

edited

Loading

ivan-morhun commented Dec 23, 2024

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

Comments

icelava commented Jan 17, 2024 • edited Loading

Shubham82 commented Jan 18, 2024

icelava commented Jan 22, 2024

ivan-morhun commented Jan 24, 2024

linxcat commented Feb 28, 2024

nooperpudd commented Apr 9, 2024 • edited Loading

ivan-morhun commented Jul 2, 2024

k8s-triage-robot commented Sep 30, 2024

k8s-triage-robot commented Oct 30, 2024

ivan-morhun commented Oct 30, 2024

Shubham82 commented Oct 30, 2024

mseiwald commented Nov 15, 2024

Shubham82 commented Nov 20, 2024

ivan-morhun commented Nov 21, 2024 • edited Loading

Shubham82 commented Nov 21, 2024

Shubham82 commented Nov 21, 2024

squ94wk commented Dec 12, 2024

phiphi282 commented Dec 20, 2024 • edited Loading

ivan-morhun commented Dec 23, 2024

icelava commented Jan 17, 2024 •

edited

Loading

nooperpudd commented Apr 9, 2024 •

edited

Loading

ivan-morhun commented Nov 21, 2024 •

edited

Loading

phiphi282 commented Dec 20, 2024 •

edited

Loading