You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Nodes are being marked as deletion candidates, but then the deletion candidate taints are released, but the nodes still have a SchedulingDisabled status after the node is no longer considered for deletion. This prevents pods from getting scheduled to the node and wastes operational costs.
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-2-34-6.ap-southeast-2.compute.internal Ready,SchedulingDisabled <none> 25m v1.14.9-eks-cc7316 10.2.34.6 <none> Amazon Linux 2 4.14.198-152.320.amzn2.x86_64 docker://19.3.6
Here are the cluster-autoscaler logs related to the node where we can see the node being considered for deletion, but then removed from consideration:
I1124 00:45:56.966552 1 node_tree.go:86] Added node "ip-10-2-34-6.ap-southeast-2.compute.internal" in group "ap-southeast-2:\x00:ap-southeast-2c" to NodeTree
I1124 00:45:59.581289 1 scale_down.go:462] Node ip-10-2-34-6.ap-southeast-2.compute.internal - cpu utilization 0.095855
I1124 00:45:59.582031 1 static_autoscaler.go:428] ip-10-2-34-6.ap-southeast-2.compute.internal is unneeded since 2020-11-24 00:45:59.579457067 +0000 UTC m=+84989.989842946 duration 0s
I1124 00:45:59.582134 1 scale_down.go:716] ip-10-2-34-6.ap-southeast-2.compute.internal was unneeded for 0s
I1124 00:45:59.596212 1 delete.go:102] Successfully added DeletionCandidateTaint on node ip-10-2-34-6.ap-southeast-2.compute.internal
I1124 00:46:09.621506 1 scale_down.go:462] Node ip-10-2-34-6.ap-southeast-2.compute.internal - cpu utilization 0.095855
I1124 00:46:09.622086 1 static_autoscaler.go:428] ip-10-2-34-6.ap-southeast-2.compute.internal is unneeded since 2020-11-24 00:45:59.579457067 +0000 UTC m=+84989.989842946 duration 10.040258663s
I1124 00:46:09.622164 1 scale_down.go:716] ip-10-2-34-6.ap-southeast-2.compute.internal was unneeded for 10.040258663s
I1124 00:46:19.637702 1 scale_down.go:462] Node ip-10-2-34-6.ap-southeast-2.compute.internal - cpu utilization 0.095855
I1124 00:46:19.638261 1 static_autoscaler.go:428] ip-10-2-34-6.ap-southeast-2.compute.internal is unneeded since 2020-11-24 00:45:59.579457067 +0000 UTC m=+84989.989842946 duration 20.056288029s
I1124 00:46:19.638410 1 scale_down.go:716] ip-10-2-34-6.ap-southeast-2.compute.internal was unneeded for 20.056288029s
I1124 00:46:29.654060 1 scale_down.go:462] Node ip-10-2-34-6.ap-southeast-2.compute.internal - cpu utilization 0.795337
I1124 00:46:29.654065 1 scale_down.go:466] Node ip-10-2-34-6.ap-southeast-2.compute.internal is not suitable for removal - cpu utilization too big (0.795337)
I1124 00:46:29.654674 1 delete.go:192] Releasing taint {Key:DeletionCandidateOfClusterAutoscaler Value:1606178759 Effect:PreferNoSchedule TimeAdded:<nil>} on node ip-10-2-34-6.ap-southeast-2.compute.internal
The affected nodes do not have the following taint (my suspicion is that CA added the taint, then removed it):
What you expected to happen:
I expect the nodes to have a Ready status, or get deleted.
How to reproduce it (as minimally and precisely as possible):
On a kubernetes 1.17 EKS cluster, launch a worker node using the amazon-eks-node-1.14-v20201007 AMI, with cluster-autoscaler and the following user-data:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 40m kubelet Starting kubelet.
Normal NodeHasSufficientMemory 40m (x2 over 40m) kubelet Node ip-10-2-34-6.ap-southeast-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 40m (x2 over 40m) kubelet Node ip-10-2-34-6.ap-southeast-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 40m (x2 over 40m) kubelet Node ip-10-2-34-6.ap-southeast-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 40m kubelet Updated Node Allocatable limit across pods
Normal Starting 39m kube-proxy Starting kube-proxy.
Normal NodeNotSchedulable 39m kubelet Node ip-10-2-34-6.ap-southeast-2.compute.internal status is now: NodeNotSchedulable
Normal NodeReady 39m kubelet Node ip-10-2-34-6.ap-southeast-2.compute.internal status is now: NodeReady
I am using a deployment with 2 replicas that have the following cluster-autoscaler image:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.17.4
Contents of the cluster-autoscaler-status ConfigMap:
status: |+
Cluster-autoscaler status at 2020-11-24 01:28:12.471105109 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=9 unready=0 notStarted=0 longNotStarted=0 registered=9 longUnregistered=0)
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-23 01:21:08.047368651 +0000 UTC m=+698.457754608
ScaleUp: NoActivity (ready=9 registered=9)
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-24 00:54:11.86357796 +0000 UTC m=+85482.273963885
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-24 00:54:11.86357796 +0000 UTC m=+85482.273963885
NodeGroups:
Name: tf-asg-20200525114556707500000001
Health: Healthy (ready=9 unready=0 notStarted=0 longNotStarted=0 registered=9 longUnregistered=0 cloudProviderTarget=9 (minSize=2, maxSize=12))
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-23 01:21:08.047368651 +0000 UTC m=+698.457754608
ScaleUp: NoActivity (ready=9 cloudProviderTarget=9)
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-24 00:54:11.86357796 +0000 UTC m=+85482.273963885
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2020-11-24 01:28:12.467836437 +0000 UTC m=+87522.878222347
LastTransitionTime: 2020-11-24 00:54:11.86357796 +0000 UTC m=+85482.273963885
What happened:
Nodes are being marked as deletion candidates, but then the deletion candidate taints are released, but the nodes still have a
SchedulingDisabled
status after the node is no longer considered for deletion. This prevents pods from getting scheduled to the node and wastes operational costs.Here are the cluster-autoscaler logs related to the node where we can see the node being considered for deletion, but then removed from consideration:
The affected nodes do not have the following taint (my suspicion is that CA added the taint, then removed it):
We can see that some nodes have the
SchedulingDisabled
for up to 6 hours:What you expected to happen:
I expect the nodes to have a
Ready
status, or get deleted.How to reproduce it (as minimally and precisely as possible):
On a kubernetes 1.17 EKS cluster, launch a worker node using the
amazon-eks-node-1.14-v20201007
AMI, withcluster-autoscaler
and the following user-data:Anything else we need to know?:
cluster-autoscaler
image:cluster-autoscaler-status
ConfigMap:Environment:
The text was updated successfully, but these errors were encountered: