Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

abstrask · 2024-01-30T15:43:44Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.27.5, but reproduced out-of-cluster off main branch (as of 28. January 2024)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version

Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.4-eks-8cb36c9

What environment is this in?:

AWS EKS using multiple Managed Node Groups. Some of these are used for batch jobs, and will scale to zero during idle periods.

It's also quite acceptable to us for the pods to be pending until previous pods have completed, if/when we reach the node group's maximum size.

What did you expect to happen?:

When EKS Managed Node Groups are scaled to zero, so there's no Kubernetes node to read taints from, taints should be read correctly from the Managed Node Group.

This avoids scaling up nodes, pending pods cannot be scheduled on anyway.

This should be supported as of this commit: [AWS EKS - Scale-to-0] Add Managed Nodegroup Cache.

In Cluster Autoscaler log, we should see "predicate checking error: node(s) had untolerated taint" for each node group with untolerated taints.

What happened instead?:

In Cluster Autoscaler log, we see no "predicate checking error: node(s) had untolerated taint" (for node groups with untolerated taints), but we do see "predicate checking error: node(s) didn't match Pod's node affinity/selector", so node labels are read correctly.

When evalutating node groups to scale-up, and there are no running nodes to query, taints are read from the Auto-Sscaling Group, and Managed Node Group (if applicable).

Kubernetes operate with the taint effects "NoSchedule", "PreferNoSchedule" and "NoExecute" (as per Kubernetes' NodeSpec documentation).

However, the DescribeNodegroup call returns taint effects as "NO_SCHEDULE", "NO_EXECUTE" and "PREFER_NO_SCHEDULE" (as per AWS' EKS API documentation)

This is also evident when describing MNG's with the AWS CLI, for example:

aws eks describe-nodegroup --cluster-name $cluster --nodegroup-name $ng --query "nodegroup.taints" --output json

[
    {
        "key": "veo.co/nodegroup-instance-spot",
        "value": "true",
        "effect": "NO_SCHEDULE"
    }
]

When we propagate the taints as tags on Auto-Scaling Groups, Cluster Autoscaler correctly reads the taints, and does not scale-up nodegroups with untolerated taints.

But in the debug log we then see this gem - two distinct taints, with the same key, but effect in different format (one from MNG taint, one from ASG tag), only one of which is correct and works as intended (the one from the ASG):

debugInfo=taints on node:
v1.Taint{Key:"veo.co/nodegroup-instance-spot", Value:"true", Effect:"NO_SCHEDULE", TimeAdded:<nil>}, <-- from MNG taint
v1.Taint{Key:"veo.co/nodegroup-instance-spot", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>},  <-- from ASG tag

This is particularly problematic for us when we reach the maximum size of the node groups our pods should (eventually) be scheduled on. In this case we saw Cluster Autoscaler scale-up seemingly random node groups that matches node selector, but with untolerated taints. As the pods kept being pending, this happens repeatedly.

How to reproduce it (as minimally and precisely as possible):

Preparation:

Create two new EKS Managed Node Groups (see script below)
1. Set a node label we'll target later (e.g. veo.co/test=true)
2. Set maximum size of both to 1
3. Add a taint to one node group that we'll not tolerate (e.g. veo.co/active=false)
4. Instance type t3.medium should do
Set Cluster Autoscaler logging level to 5 (-v==5)
Add a dummy deployment (see manifest below)
1. Target the node label from above (veo.co/test=true)
2. Do not tolerate any taints
3. Set memory request so we don't have to scale to many replicas (e.g. 1Gi)

This assumes there's already a suitable IAM role for the EKS node group. If not create one as described in Creating the Amazon EKS node IAM role.

Observe undesirable behaviour:

Scale the deployment to 2 replicas which need to span two nodes
Wait 30 seconds, grep Cluster Autoscaler logs and observe:
1. There are no "predicate checking error: node(s) had untolerated taint" entries (at least before the group has scaled up once, and its actual taints are cached)
2. Node group test2 will scale up, though its taint is not tolerated
3. Cluster Autoscaler thinks the pod can be moved the upcoming node

$ kubectl logs -n kube-system -l app=cluster-autoscaler --since=5m --tail=-1 | grep "test2"

orchestrator.go:310] Final scale-up plan: [{eks-test2-SOME_GUID 0->1 (max: 1)}]
...
klogx.go:87] Pod default/http-https-echo-6d9c4b769f-rmfb4 can be moved to template-node-for-eks-test2-SOME_GUID-upcoming-0

Observe desirable behaviour:

Scale deployment and test node groups to zero
Propagate node group taint to ASG - which shouldn't be necessary (see script below)
Restart Cluster Autoscaler
Scale the deployment to 2 replicas which need to span two nodes
Wait 30 seconds, grep Cluster Autoscaler logs and observe:
1. The test2 node group is recognised as having untolerated taints
2. debugInfo reveals taints are read from both MNG and ASG, albeit only has the correct format of "effect"
3. The correct node group, test1, is scaled up
4. As test1 reaches its max size, options are exhausted and Cluster Autoscaler does not spin up some other node group, with untolerated taints

$ kubectl logs -n kube-system -l app=cluster-autoscaler --since=5m --tail=-1 | grep "test"

orchestrator.go:466] Pod http-https-echo-6d9c4b769f-rz65v can't be scheduled on eks-test2-SOME_GUID, predicate checking error: node(s) had untolerated taint {veo.co/active: false}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {veo.co/active: false}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"veo.co/active", Value:"false", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"veo.co/active", Value:"false", Effect:"NO_SCHEDULE", TimeAdded:<nil>}}
...
orchestrator.go:310] Final scale-up plan: [{eks-test1-SOME_GUID 0->1 (max: 1)}]
...
klogx.go:87] Pod default/http-https-echo-6d9c4b769f-6bbb8 can be moved to template-node-for-eks-test1-SOME_GUID-upcoming-0
...
orchestrator.go:142] Skipping node group eks-test1-SOME_GUID - max size reached

Create EKS Managed Node Groups:

cluster=<CLUSTER_NAME>
role=<ROLE_ARN>
subnets=subnet-xxxxxxxxxxxxxxxxx

aws eks create-nodegroup --cluster-name $cluster --nodegroup-name test1 --node-role $role --subnets $subnets \
    --scaling-config minSize=0,desiredSize=0,maxSize=1 \
    --labels "veo.co/test=true"

aws eks create-nodegroup --cluster-name $cluster --nodegroup-name test2 --node-role $role --subnets $subnets \
    --scaling-config minSize=0,desiredSize=0,maxSize=1 \
    --labels "veo.co/test=true" --taints "key=veo.co/active,value=false,effect=NO_SCHEDULE"

Dummy deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: http-https-echo
  name: http-https-echo
  namespace: default
spec:
  replicas: 0
  selector:
    matchLabels:
      app: http-https-echo
  template:
    metadata:
      labels:
        app: http-https-echo
    spec:
      nodeSelector:
        veo.co/test: "true"
      containers:
        - image: mendhak/http-https-echo
          name: http-https-echo
          resources:
            requests:
              memory: 2Gi

Set taints as Auto-Scaling Group tags:

asg=$(aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=$cluster" --filters "Name=tag:eks:nodegroup-name,Values=test2" --query 'AutoScalingGroups[].AutoScalingGroupName' --output text)
aws autoscaling create-or-update-tags \
    --tags ResourceId=$asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/node-template/taint/veo.co/active,Value=false:NoSchedule,PropagateAtLaunch=false

Anything else we need to know?:

Similar issues:

Huge shout-out to @wcarlsen for invaluable help in finding the culprit! He's also opened PR #6482 to translate EKS taint effects to Kubernetes taint effects, which in our tests resolves this issue.

The text was updated successfully, but these errors were encountered:

Shubham82 · 2024-01-31T10:29:28Z

/area provider/aws

abstrask · 2024-01-31T13:05:47Z

Proposed fix: #6482

mwoodson-cb · 2024-05-16T18:06:47Z

This seems to be affecting us as well. We have placed taints on our AWS EKS Managed Node groups. However, CA doesn't seem to take those taints into consideration when it spins new nodes up, nothing can land on those new nodes, so it will spin up other nodes in other node groups. It's a viscous cycle!

I haven't tested the proposed fix, but seems like it would potentially solve this issue we are facing.

abstrask · 2024-08-30T11:37:36Z

The fix (#6482) is included in release 1.31.0 🥳

abstrask added the kind/bug Categorizes issue or PR as related to a bug. label Jan 30, 2024

wcarlsen mentioned this issue Jan 30, 2024

Fixed bug with missing taint translation layer between AWS EKS format and Kubernetes format #6482

Merged

This was referenced Jan 30, 2024

Autoscaler not respecting the taint tag in AWS #2434

Closed

cluster autoscaler doesn't apply eks-managed-ng's taint #5902

Closed

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 31, 2024

towca added the area/cluster-autoscaler label Mar 21, 2024

k8s-ci-robot closed this as completed in #6482 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

abstrask commented Jan 30, 2024 •

edited

Loading

Shubham82 commented Jan 31, 2024

abstrask commented Jan 31, 2024

mwoodson-cb commented May 16, 2024

abstrask commented Aug 30, 2024

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Comments

abstrask commented Jan 30, 2024 • edited Loading

Shubham82 commented Jan 31, 2024

abstrask commented Jan 31, 2024

mwoodson-cb commented May 16, 2024

abstrask commented Aug 30, 2024

abstrask commented Jan 30, 2024 •

edited

Loading