-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster-autoscaler not balancing similar node groups on AWS #3515
Comments
I'm assuming you're using balance-similar-node-groups since you rely on EBS? Otherwise you can just put multiple AZs in one ASG and let ASG do the balancing. As I understand it #3124 should have resolved your issue. |
That's correct, we use one group per AZ because we have some deployments that require persistent storage (EBS). I also thought #3124 would have done the trick but no such luck it seems. For reference, here are memory specs on some nodes in a cluster that isn't balancing properly:
Which translates to roughly 1.1%, which should be within the new 1.5% limit. |
If I'm reading the source for 1.19.0 correctly, the It's the So perhaps check Also, it shouldn't make a difference, but |
Ah right, here's capacity on my test cluster. Also roughly 1.1%:
Yeah I wasn't able to trace |
This could potentially be an issue with daemonsets or system pods. Can you check values for all 3
|
Ok so digging through what "Free" means in this context because it's not a value you can get with the node info (correct me if I'm wrong anywhere, this is my first time really reading through this): The autoscaler looks to make sure "free" resources on a node are within 5% as part of similarity checks:
autoscaler/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go Lines 31 to 33 in 67dce2e
It does so by taking the Allocatable value and then subtracting the resources Requested: autoscaler/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go Lines 125 to 128 in 67dce2e
The type definition for Requested says it accounts for all pods on the node: This would maybe make sense if it was only accounting for daemonsets, but accounting for ALL pods on a running node means you will often not be within 5%. I'm not sure this would ever work for systems actually in use unless they had very uniform deployments. Maybe I'm reading this wrong? What's the point in comparing Free resources to determine similarity of node groups? |
Nice digging! If true, this sounds like a bug to me. |
I used the |
I had been wondering about the 'Free' check. Under what circumstances is that useful, even if it worked as-described? I'd think that if it was trying to detect "Some DaemonSet is only present on some of these nodegroups", then wouldn't there be a label or something that the DaemonSet is using to distinguish the nodes to run on, and it would mark the groups dissimilar. Hmm. If you do have DaemonSets that have Pods on some of your nodegroups but not others, then this is working-as-intended. I assume it's not the problem here, but maybe worth checking? Inside the fold, I go looking for a place to fix bad 'Free' calculations, and it turns out the code's there. I guess if the DaemonSet was using a nodeSelector for one of the ignored labels, e.g., you were running a DaemonSet only in one AZ, but not another, then this check is relevant, if you expect that difference to mark the node-groups dissimilar. I'd expect a difference like that would be handled elsewhere, when CA is determining how many nodes to scale up, because I assume it's still using the individual node-group details when calculating "available resources for pods", rather than averaging the "available resources for pods" across the balanced node-groups. Which suggests to me that there is code in the codebase somewhere to calculate the "Free" value we actually need here.
And that last edit points us at a fix for this issue, if it's been diagnosed correctly: We can use the NodeInfo Pods list which in both the "scale from zero" and "copy an existing node in the group" cases, only contains the DaemonSets. It's also possible that the bug is that after building the node list, neither of these code-paths appears to recalculate Edit: Although And from the hidden details, one question does appear: Is there any chance some of your existing nodes have Static or Mirror pods (annotation I suspect at this point, a controlled reproduction case with |
Yeah this seems odd to me. If capacity and allocatable are similar, then the node groups are similar. Anything else running on them that would make them different is a concern outside of cluster-autoscaler in my opinion, since it's an infrastructure-level construct and shouldn't be concerned about system/application-level things like pods (other than "I need to make more room for these pods"). For use cases that I would assume are less common, where you would consider groups to be different based on certain pods running on them, you would just use labels to indicate them being different.
Confirmed, all of the groups in my scenario are the same and have the same daemonsets running on them.
We don't configure any, unless EKS / EBS CSI driver / something else internal to the system is doing it. I don't see any mirror pods at least.
If someone can build such (or really just make the proposed changes on a branch somewhere, I can probably figure out how to build it), I'm happy to test it. |
@ellistarn - Any update on this? Thanks! |
Balancing similar nodegroups have 2 distinct checks:
tl;dr - Balancing will not happen in a scale-up where at least one of pending pods uses EBS. edit: To provide some more details - CA algorithm can only do binpacking for a single nodegroup. It can't calculate a mixed scale-up (ie. a scale-up that would resize multiple nodegroups). Balancing is just a heuristic that tries to workaround this by applying a result of binpacking done on a single nodepool to multiple nodepools, which is where its limitations come from. This is also where limitations in comparing nodegroups come from - it would clearly be better for many use-cases to balance nodegroups that differ more. However, the code just assumes the result of binpacking is exactly the same and so the nodegroups must be as close as possible from scheduling perspective. |
Assuming the original bug report was using this example Perhaps @binman-docker, you could post the Pod details from your test, to check that there's nothing else in the Pod affinity etc. that would be failing the second step that @MaciekPytel has described. Edit: Poking at the 1.19.0 code, if the second step (which I believe is a call to So I don't think we're hitting this problem, it still seems like the comparator is reporting |
Yeah there are no EBS volumes used in this test, here's the yaml:
|
Debug hacks to help diagnose kubernetes#3515
Debug hacks to help diagnose kubernetes#3515 Signed-off-by: Paul "Hampy" Hampson <[email protected]>
I've gone ahead and created a branch with logging for all 'not-similar' rejection paths (at I've only build-tested this, and it's not intended for submission since it would generate way too much noise on a production cluster. It should be pretty easy to produce a container image for this. Assuming you've followed the Getting the Code instructions, but with my branch checked out, it's a one-liner:
|
Thank you! I will test this week and report back. |
Thanks again for those edits. It seems like the See first comment for the configuration. Confirmed from the logs:
Group similarity rejected thanks to the
And again thanks to the label
There's no documentation for the But I'm pretty sure I've formatted it correctly; it's a multiStringFlag just like |
Okay, from that clue, I now know what's going on. It turns out
is a single string-value option Similarly,
is a single string-value option That's what the So it's user-error due to total lack of documentation. The Try:
although as-noted, that last two should not be necessary. This would have been much easier to diagnose if, for example, I wonder if we could also improve this by having the |
Yeah that bracketed list being quoted seemed off to me, which is why I tried to double-check with a similar flag. Let me confirm it works with the hopefully correct formatting, and then I'll open a PR to update the docs. |
Yep that's better:
Ideally I would have liked to see it split across all three groups, but I assume it's trying to scale evenly (2 and 2) rather than unevenly (1 and 1 and 2), and that would be a separate issue. I'll open a PR today to update the docs. Thank you to everyone who helped diagnose! |
When using custom launch template, these label are added:
I guess there should be ignore also for this to work ? because LT and version might be different on similar pool |
|
@artificial-aidan does you have to add "eks.amazonaws.com/nodegroup","beta.kubernetes.io/instance-type" and "node.kubernetes.io/instance-type" to --balancing-ignore-label? |
I'm using these in autoscaler config:
default topology are already added by cluster autoscaler and if you are using only 1 instance type per node pool you should not add That make me wonder how people using multiple instance types per asg are using the balance similar node group ? On on which label is it based ? I mean it works with CPU/RAM resources but when doing scale to and from 0 I guess it needs a selector or an indication of which ASG to scale |
looks like the reverse idea is much better: in my case I have 2 spot nodepools with 3 similar types from same family thanks @ArchiFleKs |
Environment configuration:
Action:
Run a deployment with enough replicas and CPU/Mem reservation to require several new nodes.
Expected outcome:
Cluster-autoscaler sees the three groups as "similar" and increases capacity in each by roughly equal amounts.
Actual outcome:
The cluster-autoscaler only selects a single group to increase by 3, instead of increasing by 1 in each of the 3 groups.
What I've tried:
random
, thinking maybe it was picking one group because it would be slightly less waste. No change in behavior.Unfortunately I have nothing else to go on, even the debug logs don't have any info on why the groups are not considered similar or why the autoscaler isn't balancing across them. Any advice appreciated, thanks!
The text was updated successfully, but these errors were encountered: