Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481
Labels
area/cluster-autoscaler
area/provider/aws
Issues or PRs related to aws provider
kind/bug
Categorizes issue or PR as related to a bug.
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.27.5, but reproduced out-of-cluster off
main
branch (as of 28. January 2024)Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
AWS EKS using multiple Managed Node Groups. Some of these are used for batch jobs, and will scale to zero during idle periods.
It's also quite acceptable to us for the pods to be pending until previous pods have completed, if/when we reach the node group's maximum size.
What did you expect to happen?:
When EKS Managed Node Groups are scaled to zero, so there's no Kubernetes node to read taints from, taints should be read correctly from the Managed Node Group.
This avoids scaling up nodes, pending pods cannot be scheduled on anyway.
This should be supported as of this commit: [AWS EKS - Scale-to-0] Add Managed Nodegroup Cache.
In Cluster Autoscaler log, we should see "predicate checking error: node(s) had untolerated taint" for each node group with untolerated taints.
What happened instead?:
In Cluster Autoscaler log, we see no "predicate checking error: node(s) had untolerated taint" (for node groups with untolerated taints), but we do see "predicate checking error: node(s) didn't match Pod's node affinity/selector", so node labels are read correctly.
When evalutating node groups to scale-up, and there are no running nodes to query, taints are read from the Auto-Sscaling Group, and Managed Node Group (if applicable).
Kubernetes operate with the taint effects "NoSchedule", "PreferNoSchedule" and "NoExecute" (as per Kubernetes' NodeSpec documentation).
However, the
DescribeNodegroup
call returns taint effects as "NO_SCHEDULE", "NO_EXECUTE" and "PREFER_NO_SCHEDULE" (as per AWS' EKS API documentation)This is also evident when describing MNG's with the AWS CLI, for example:
When we propagate the taints as tags on Auto-Scaling Groups, Cluster Autoscaler correctly reads the taints, and does not scale-up nodegroups with untolerated taints.
But in the debug log we then see this gem - two distinct taints, with the same key, but effect in different format (one from MNG taint, one from ASG tag), only one of which is correct and works as intended (the one from the ASG):
This is particularly problematic for us when we reach the maximum size of the node groups our pods should (eventually) be scheduled on. In this case we saw Cluster Autoscaler scale-up seemingly random node groups that matches node selector, but with untolerated taints. As the pods kept being pending, this happens repeatedly.
How to reproduce it (as minimally and precisely as possible):
Preparation:
veo.co/test=true
)veo.co/active=false
)t3.medium
should do-v==5
)deployment
(see manifest below)veo.co/test=true
)1Gi
)This assumes there's already a suitable IAM role for the EKS node group. If not create one as described in Creating the Amazon EKS node IAM role.
Observe undesirable behaviour:
test2
will scale up, though its taint is not toleratedObserve desirable behaviour:
test2
node group is recognised as having untolerated taintsdebugInfo
reveals taints are read from both MNG and ASG, albeit only has the correct format of "effect"test1
, is scaled uptest1
reaches its max size, options are exhausted and Cluster Autoscaler does not spin up some other node group, with untolerated taintsCreate EKS Managed Node Groups:
Dummy
deployment
:Set taints as Auto-Scaling Group tags:
Anything else we need to know?:
Similar issues:
Huge shout-out to @wcarlsen for invaluable help in finding the culprit! He's also opened PR #6482 to translate EKS taint effects to Kubernetes taint effects, which in our tests resolves this issue.
The text was updated successfully, but these errors were encountered: