Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Closed
abstrask opened this issue Jan 30, 2024 · 4 comments · Fixed by #6482
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.

Comments

@abstrask
Copy link

abstrask commented Jan 30, 2024

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

1.27.5, but reproduced out-of-cluster off main branch (as of 28. January 2024)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version

Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.4-eks-8cb36c9

What environment is this in?:

AWS EKS using multiple Managed Node Groups. Some of these are used for batch jobs, and will scale to zero during idle periods.

It's also quite acceptable to us for the pods to be pending until previous pods have completed, if/when we reach the node group's maximum size.

What did you expect to happen?:

When EKS Managed Node Groups are scaled to zero, so there's no Kubernetes node to read taints from, taints should be read correctly from the Managed Node Group.

This avoids scaling up nodes, pending pods cannot be scheduled on anyway.

This should be supported as of this commit: [AWS EKS - Scale-to-0] Add Managed Nodegroup Cache.

In Cluster Autoscaler log, we should see "predicate checking error: node(s) had untolerated taint" for each node group with untolerated taints.

What happened instead?:

In Cluster Autoscaler log, we see no "predicate checking error: node(s) had untolerated taint" (for node groups with untolerated taints), but we do see "predicate checking error: node(s) didn't match Pod's node affinity/selector", so node labels are read correctly.

When evalutating node groups to scale-up, and there are no running nodes to query, taints are read from the Auto-Sscaling Group, and Managed Node Group (if applicable).

Kubernetes operate with the taint effects "NoSchedule", "PreferNoSchedule" and "NoExecute" (as per Kubernetes' NodeSpec documentation).

However, the DescribeNodegroup call returns taint effects as "NO_SCHEDULE", "NO_EXECUTE" and "PREFER_NO_SCHEDULE" (as per AWS' EKS API documentation)

This is also evident when describing MNG's with the AWS CLI, for example:

aws eks describe-nodegroup --cluster-name $cluster --nodegroup-name $ng --query "nodegroup.taints" --output json
[
    {
        "key": "veo.co/nodegroup-instance-spot",
        "value": "true",
        "effect": "NO_SCHEDULE"
    }
]

When we propagate the taints as tags on Auto-Scaling Groups, Cluster Autoscaler correctly reads the taints, and does not scale-up nodegroups with untolerated taints.

But in the debug log we then see this gem - two distinct taints, with the same key, but effect in different format (one from MNG taint, one from ASG tag), only one of which is correct and works as intended (the one from the ASG):

debugInfo=taints on node:
v1.Taint{Key:"veo.co/nodegroup-instance-spot", Value:"true", Effect:"NO_SCHEDULE", TimeAdded:<nil>}, <-- from MNG taint
v1.Taint{Key:"veo.co/nodegroup-instance-spot", Value:"true", Effect:"NoSchedule", TimeAdded:<nil>},  <-- from ASG tag

This is particularly problematic for us when we reach the maximum size of the node groups our pods should (eventually) be scheduled on. In this case we saw Cluster Autoscaler scale-up seemingly random node groups that matches node selector, but with untolerated taints. As the pods kept being pending, this happens repeatedly.

How to reproduce it (as minimally and precisely as possible):

Preparation:

  1. Create two new EKS Managed Node Groups (see script below)
    1. Set a node label we'll target later (e.g. veo.co/test=true)
    2. Set maximum size of both to 1
    3. Add a taint to one node group that we'll not tolerate (e.g. veo.co/active=false)
    4. Instance type t3.medium should do
  2. Set Cluster Autoscaler logging level to 5 (-v==5)
  3. Add a dummy deployment (see manifest below)
    1. Target the node label from above (veo.co/test=true)
    2. Do not tolerate any taints
    3. Set memory request so we don't have to scale to many replicas (e.g. 1Gi)

This assumes there's already a suitable IAM role for the EKS node group. If not create one as described in Creating the Amazon EKS node IAM role.

Observe undesirable behaviour:

  1. Scale the deployment to 2 replicas which need to span two nodes
  2. Wait 30 seconds, grep Cluster Autoscaler logs and observe:
    1. There are no "predicate checking error: node(s) had untolerated taint" entries (at least before the group has scaled up once, and its actual taints are cached)
    2. Node group test2 will scale up, though its taint is not tolerated
    3. Cluster Autoscaler thinks the pod can be moved the upcoming node
$ kubectl logs -n kube-system -l app=cluster-autoscaler --since=5m --tail=-1 | grep "test2"

orchestrator.go:310] Final scale-up plan: [{eks-test2-SOME_GUID 0->1 (max: 1)}]
...
klogx.go:87] Pod default/http-https-echo-6d9c4b769f-rmfb4 can be moved to template-node-for-eks-test2-SOME_GUID-upcoming-0

Observe desirable behaviour:

  1. Scale deployment and test node groups to zero
  2. Propagate node group taint to ASG - which shouldn't be necessary (see script below)
  3. Restart Cluster Autoscaler
  4. Scale the deployment to 2 replicas which need to span two nodes
  5. Wait 30 seconds, grep Cluster Autoscaler logs and observe:
    1. The test2 node group is recognised as having untolerated taints
    2. debugInfo reveals taints are read from both MNG and ASG, albeit only has the correct format of "effect"
    3. The correct node group, test1, is scaled up
    4. As test1 reaches its max size, options are exhausted and Cluster Autoscaler does not spin up some other node group, with untolerated taints
$ kubectl logs -n kube-system -l app=cluster-autoscaler --since=5m --tail=-1 | grep "test"

orchestrator.go:466] Pod http-https-echo-6d9c4b769f-rz65v can't be scheduled on eks-test2-SOME_GUID, predicate checking error: node(s) had untolerated taint {veo.co/active: false}; predicateName=TaintToleration; reasons: node(s) had untolerated taint {veo.co/active: false}; debugInfo=taints on node: []v1.Taint{v1.Taint{Key:"veo.co/active", Value:"false", Effect:"NoSchedule", TimeAdded:<nil>}, v1.Taint{Key:"veo.co/active", Value:"false", Effect:"NO_SCHEDULE", TimeAdded:<nil>}}
...
orchestrator.go:310] Final scale-up plan: [{eks-test1-SOME_GUID 0->1 (max: 1)}]
...
klogx.go:87] Pod default/http-https-echo-6d9c4b769f-6bbb8 can be moved to template-node-for-eks-test1-SOME_GUID-upcoming-0
...
orchestrator.go:142] Skipping node group eks-test1-SOME_GUID - max size reached

Create EKS Managed Node Groups:

cluster=<CLUSTER_NAME>
role=<ROLE_ARN>
subnets=subnet-xxxxxxxxxxxxxxxxx

aws eks create-nodegroup --cluster-name $cluster --nodegroup-name test1 --node-role $role --subnets $subnets \
    --scaling-config minSize=0,desiredSize=0,maxSize=1 \
    --labels "veo.co/test=true"

aws eks create-nodegroup --cluster-name $cluster --nodegroup-name test2 --node-role $role --subnets $subnets \
    --scaling-config minSize=0,desiredSize=0,maxSize=1 \
    --labels "veo.co/test=true" --taints "key=veo.co/active,value=false,effect=NO_SCHEDULE"

Dummy deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: http-https-echo
  name: http-https-echo
  namespace: default
spec:
  replicas: 0
  selector:
    matchLabels:
      app: http-https-echo
  template:
    metadata:
      labels:
        app: http-https-echo
    spec:
      nodeSelector:
        veo.co/test: "true"
      containers:
        - image: mendhak/http-https-echo
          name: http-https-echo
          resources:
            requests:
              memory: 2Gi

Set taints as Auto-Scaling Group tags:

asg=$(aws autoscaling describe-auto-scaling-groups --filters "Name=tag:eks:cluster-name,Values=$cluster" --filters "Name=tag:eks:nodegroup-name,Values=test2" --query 'AutoScalingGroups[].AutoScalingGroupName' --output text)
aws autoscaling create-or-update-tags \
    --tags ResourceId=$asg,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/node-template/taint/veo.co/active,Value=false:NoSchedule,PropagateAtLaunch=false

Anything else we need to know?:

Similar issues:

Huge shout-out to @wcarlsen for invaluable help in finding the culprit! He's also opened PR #6482 to translate EKS taint effects to Kubernetes taint effects, which in our tests resolves this issue.

@Shubham82
Copy link
Contributor

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Jan 31, 2024
@abstrask
Copy link
Author

Proposed fix: #6482

@mwoodson-cb
Copy link

This seems to be affecting us as well. We have placed taints on our AWS EKS Managed Node groups. However, CA doesn't seem to take those taints into consideration when it spins new nodes up, nothing can land on those new nodes, so it will spin up other nodes in other node groups. It's a viscous cycle!

I haven't tested the proposed fix, but seems like it would potentially solve this issue we are facing.

@abstrask
Copy link
Author

The fix (#6482) is included in release 1.31.0 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
5 participants