Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler still thinks node group has taints (after untaint) and refuses to scale back up from zero count #6452

Open
icelava opened this issue Jan 17, 2024 · 18 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@icelava
Copy link

icelava commented Jan 17, 2024

Which component are you using?: cluster-autoscaler

What version of the component are you using?: cluster-autoscaler

Component version: helm chart 9.26.0 cluster-autoscaler 1.28.2

What k8s version are you using (kubectl version)?: 1.28

kubectl version Output
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5-eks-5e0fdde

What environment is this in?: AWS EKS; managed node groups

What did you expect to happen?: Untainted node group be able launch nodes and schedule pods again.

What happened instead?: autoscaler still thinks node group has the long-gone node with taint and thus won't launch another node instance, even though the taint has been removed from node group.

│ Events:                                                                                                              │
│   Type     Reason             Age                     From                Message                                    │
│   ----     ------             ----                    ----                -------                                    │
│   Normal   NotTriggerScaleUp  22m (x385 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}  │
│   Normal   NotTriggerScaleUp  12m (x179 over 19h)     cluster-autoscaler  pod didn't trigger scale-up: 3 max node gr │
│ oup size reached, 1 node(s) had untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector  │
│   Normal   NotTriggerScaleUp  7m11s (x1152 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had │
│  untolerated taint {cost: true}, 1 node(s) didn't match Pod's node affinity/selector, 3 max node group size reached  │
│   Warning  FailedScheduling   5m (x161 over 19h)      default-scheduler   0/8 nodes are available: 8 node(s) didn't  │
│ match Pod's node affinity/selector. preemption: 0/8 nodes are available: 8 Preemption is not helpful for scheduling. │
│ .                                                                                                                    │
│   Normal   NotTriggerScaleUp  2m10s (x2508 over 19h)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) did │
│ n't match Pod's node affinity/selector, 1 node(s) had untolerated taint {cost: true}, 3 max node group size reached  │
│   Normal   TriggeredScaleUp   50s                     cluster-autoscaler  pod triggered scale-up: [{eks-ZeroNodes-ce │
│ c688d4-02dd-c7f5-6bd1-be1a14735f61 0->1 (max: 1)}]                                                                   │
│   Normal   Scheduled          3s                      default-scheduler   Successfully assigned default/nginx-test t │
│ o ip-10-0-76-17.ap-southeast-1.compute.internal                                                                      │
│   Normal   Pulling            2s                      kubelet             Pulling image "nginx:latest"               │

How to reproduce it (as minimally and precisely as possible):

Taint node group to evict pods

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints addOrUpdateTaints={key=cost,value=true,effect=NO_EXECUTE}

Pods get evicted and nodes eventually terminated to zero count.

Untaint node group to re-host pods.

aws eks update-nodegroup-config --cluster-name eks-cluster --nodegroup-name ZeroNodes --taints removeTaints={key=cost,value=true,effect=NO_EXECUTE}

Pod remains in perpetual pending state, as per events above, with autoscaler thinking the the old node with taint still around and refuses to launch another node in its place (without the taint).

Anything else we need to know?:
Workaround to deliberately kill autoscaler pod, so replacement autoscaler pod with no memory of the past can correctly auto-discover the node group and launch node to host the pod as per events above.

It seems like autoscaler wants to hang on to outdated historical data of terminated nodes in node group "1 node(s) had untolerated taint". It should be describing the node groups afresh to determine where to launch a new node.

@icelava icelava added the kind/bug Categorizes issue or PR as related to a bug. label Jan 17, 2024
@Shubham82
Copy link
Contributor

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 18, 2024
@icelava
Copy link
Author

icelava commented Jan 22, 2024

Additional note on the workaround. We essentially restart the autoscaler deployment as the last step of our automation workflow.

kubectl rollout restart deployment/aws-cluster-autoscaler -n kube-system

@ivan-morhun
Copy link

I was experiencing the same behavior when the NodeTermination Handler marked the last node in the node pool with his taint and ASG reached 0 size. After this ClusterAutoscaler is not able to scale it up because of

2024-01-24T08:58:11+03:00 I0124 05:58:11.046721       1 orchestrator.go:546] Pod gitlab-runner/runner-eucgy1fpg-project-517-concurrent-1-f41lk8jv can't be scheduled on ciq-ci-gitlab-agents2023101009545070710000000e, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-65653465366439612d303362612d373737302d353735}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"gitlab-agent", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"aws-node-termination-handler/asg-lifecycle-termination", Value:"asg-lifecycle-term-65653465366439612d303362612d373737302d353735", Effect:"NoExecute", TimeAdded:<nil>}}

CA pod restart helps

@linxcat
Copy link

linxcat commented Feb 28, 2024

Confirming replication of this issue on a production cluster, helm chart 9.25.0 version 1.24.0

@nooperpudd
Copy link

nooperpudd commented Apr 9, 2024

I have the same issues if I create mixed instance types in the same node group, I only kept one instance type in the different node groups, the taint feature can work if the node scales up from 0.

desiredCapacity: 0
minSize: 0
maxSize: 10 
tags:
  k8s.io/cluster-autoscaler/enabled: "true"
  k8s.io/cluster-autoscaler/blocknode: "owned"
  k8s.io/cluster-autoscaler/node-template/taint/xxx.com/name: "true:NoSchedule"

taints:
  - key: xxx.com/name
    value: true
    effect: NoSchedule

@ivan-morhun
Copy link

The version 1.30.1 has the same issue.
Node was terminated last night, ASG has desired capacity 0, but CA still "see" the node in the cluster and doesn't scale up the ASG

{"ts":1719909365710.1602,"caller":"orchestrator/orchestrator.go:565","msg":"Pod jenkins-aqa/aqa-build-agent-235-qjzfw-cw6lm can't be scheduled on ciq-ci-jenkins-aqa-tests-agents20230808050032196800000003, predicate checking error: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {aws-node-termination-handler/asg-lifecycle-termination: asg-lifecycle-term-34363838643263332d303665352d326261662d333838}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:\"aws-node-termination-handler/asg-lifecycle-termination\", Value:\"asg-lifecycle-term-34363838643263332d303665352d326261662d333838\", Effect:\"NoExecute\", TimeAdded:<nil>}}","v":2}

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2024
@ivan-morhun
Copy link

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 30, 2024
@Shubham82
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 30, 2024
@mseiwald
Copy link

Isn't this fixed by this?

@Shubham82
Copy link
Contributor

Isn't this fixed by this?

Hi @icelava @ivan-morhun, Could you please check with CA v1.31, as PR #6482 is merged in CA 1.31.0
Please see the Changelog: https://github.com/kubernetes/autoscaler/releases/tag/cluster-autoscaler-1.31.0

@ivan-morhun
Copy link

ivan-morhun commented Nov 21, 2024

I currently have exactly 1.31 in one cluster and the issue still exists.
CA sees the ghost "dead" node with toleration from the NodeTermination Handler
I restart the CA hourly to "reset" its state...
Thanks @Shubham82

@Shubham82
Copy link
Contributor

I currently have exactly 1.31 in one cluster and the issue still exists. CA sees the ghost "dead" node with toleration from the NodeTermination Handler I restart the CA hourly to "reset" its state... Thanks @Shubham82

Thanks @ivan-morhun for the confirmation.

@Shubham82
Copy link
Contributor

cc @drmorr0
could you please take a look?

@squ94wk
Copy link

squ94wk commented Dec 12, 2024

We encountered pretty much this exact same issue running CA 1.28.6 with the ClusterAPI "cloud provider".
Our MachineDeployment has expected node taints but also the respective capacity.cluster-autoscaler.kubernetes.io/taints annotation for scaling up from zero.

We add a taint during node initialization between the node joining and CNI becoming ready.
CA still thinks nodes resulting from the MD will have this taint.

@phiphi282
Copy link

phiphi282 commented Dec 20, 2024

We managed to fix the issue with the clusterAPI provider by setting the --status-taint flag for the autoscaler.
This way our transient taint gets ignored by the autoscaler when creating the node template.

There also is the --node-info-cache-expire-time flag that has a default expire time of 10 years. By setting this to a lower value you don't have to restart the pod anymore.
This also helps if the nodegroup gets changed while scaled to 0 which would also not get noticed by the autoscaler.

edit: the --status-taint flag got introduced in CA 1.29 :)

@ivan-morhun
Copy link

@phiphi282 Thanks for the advice.
The --status-taint flag doesn't suit us, whereas --node-info-cache-expire-time looks much better than the Jenkins job that restarts CA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests