-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uneven scale-up of AWS ASG's #2020
Comments
/assign |
/sig aws |
by default it is "random" expander, if that is the case, scenario you described does not seem highly improbable.Does it happen each time? |
We're using the least-waste expander presently, but in this case the calculations for each of the ASG's is identical, so shouldn't it then choose to scale up to make the ASG's balanced? |
None of existing expanders cares about zone balancing, so it shouldn't matter which one you use. CA has a separate mechanism for balancing: it finds 'similar' NodeGroups (ASGs) and splits any scale-up between them. This happens after the expander makes a decision and it shouldn't depend on expander at all. |
All of the nodes have identical labels except zone and hostname - we used terraform to create and tag the ASG's they reside in. `kubectl get nodes -l environment=playground --show-labels ip-10-0-187-220.eu-west-1.compute.internal Ready 7m59s v1.12.7 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.medium,beta.kubernetes.io/os=linux,cluster=zeus,environment=playground,failure-domain.beta.kubernetes.io/region=eu-west-1,failure-domain.beta.kubernetes.io/zone=eu-west-1b,kubernetes.io/hostname=ip-10-0-187-220.domain.com,nodegroup=scalemultiaz,workload=scalemultiaz` |
for autoscaler/cluster-autoscaler/expander/waste/waste.go Lines 33 to 35 in cb4e60f
In your case, I think all 3 nodegroups are qualified, and random strategy will work and pick one of them. |
We’re hitting this issue in our clusters too. I think I know why. We have 3 ASGs with c5.2xlarge spread across 3 AZs. Looks like when Amazon creates an EC2 instance either 15835076Ki or 15835084Ki total memory is provisioned for the VM. This is what is reported in the node When the cluster autoscaler attempts to discover similar node groups, it requires an exact match in memory capacity here: autoscaler/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go Lines 79 to 81 in 0968736
Also seems like node groups that are scaled to zero, get a different memory capacity again, maybe this?
Do we need some sort of tolerance in the capacity comparison, similar to the allocatable and free comparisons? |
@meringu If that's the case, we have some options
|
Regarding point 2 - the comparison logic lives in processor, ie. it's hidden behind an interface specifically to allow adding custom implementations without touching common code. If required you should be able to customize the logic by adding your own implementation of https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/nodegroup_set_processor.go. You may be able to reuse most of implementation of balancing processor (default), just change the comparison logic. |
Thanks for guidance! @MaciekPytel |
@meringu I will try to reproduce this issue on my end and check with EC2 team at the same time. It will take some time and come back to you later then |
Any update on the issue @Jeffwan? As I understand, this issue would be causing |
@meringu Sorry I was on call in past two weeks. Get some times this week to check this issue. |
Thanks @Jeffwan. Let me know if there is anything I can help with. |
I just ran into a similar thing, with a kops-built cluster, and I know exactly why it is ignoring the balance-similar-node-groups flag. The problem, at least in my case, and it sounds similar to the above description, is that kops is adding it's own labelling to the node, IE: One possible fix, specific to kops clusters, would be to add the Not sure if other tools IE: for EKS have well-known labels like this that could also be added, or if the better solution would be to add a flag to let additional ignore labels be specified at runtime. Thoughts? |
Not the case for our EKS cluster. We do have some control over the labels. The only different node labels on our clusters are in the ignoredLabels list. It is the capacity check in our case, as the logs show it sometimes finds a similar node group or two, depending on if the node group has any instances, and a bit random as AWS doesn't give the exact amount of memory everytime. |
Hi @Jeffwan, did you get a change to look at this last week? Is there anything I can help with? We are happy to contribute engineer time if that would be helpful. |
@meringu Sorry for late response. It would be helpful to help identify the problem. I am trying to see if this can be easily reproducible. In the ca log @danmcnulty provide, it failed node selector, could you share the logs and make sure it failed on mismatch of memory?
I did some search and didn't find any clues of memory mismatch for one instance. Could you file a service ticket to AWS ec2 team? (I don't have all your details) |
Unfortunately there are no logs for the comparisons, you can see in my above link for the Summarised: c5.2xlarge: I'll raise a support ticket asking them about the difference in memory from instance to instance, and the why the advertised memory is different again. |
@meringu this sounds like a similar problem to what I'm facing. In the event it is relevant, you may wish to skim aleksandra-malinowska's comment. Here's a snippet:
|
@Jeffwan Just a quick note on the logs I provided - the ASG with prefix "eks-zeus-autoscaler" was used for something unrelated by us, and so it was expected that this group didn't match the node selector. The question I have is about the 3 similar groups with prefixes: |
Hey team, I've heard back from AWS about the memory capacity disparity. For the difference reported on the running hosts, I was using two different AMIs with different kernel versions. After updating all the instances to the newer AMI, they all report the same memory capacity. With regards to the difference in reported in the API vs reported on the node, I got this response:
This explains why two different AMIs can have different memory capacity, as they can have different kernel versions etc. This also means the capacity in the pricing API will never match the reported capacity from the node. So now looking at this comment in the code: autoscaler/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go Lines 79 to 81 in 0968736
|
Or I could disable scaling to zero for my ASGs. This means extra overhead however. |
Isn't it possible to simplify the "similar check" to comparing just the instance type? |
@ewoutp You could have multiple node pools of the same instance type that you want scaled separately for one reason or another. The main issue is where you have pools that are the same for all intents across differing zones, and some tag/label is different between the zones and getting in the way. See this PR for what I am talking about: #2207 |
Not the complete check, however for the CPU and especially the memory part (which seems to be a little off for now) ? |
@jhohertz That is true. However I suspect there are many cases where comparison on instanceType alone are sufficient (and working very well). |
I have 3 ASGs with c5.9xlarge spread across 3 AZs, however when CA tries to make a decision on splitting scale-up nodes between the node groups, most of the times it split among 2 AZs. Why do we want to check the 5% free resources? Aren't labels, capacity and allocatable sufficient enough to compare the node group similarity? |
I'm seeing this issue with m5.xlarge.
Updating MaxMemoryDifferenceInKiloBytes to 256000 fixes the issue. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/nodegroupset/compare_nodegroups.go#L36 For those using the AWS CNI with custom networking, the label "k8s.amazonaws.com/eniConfig" needs to be whitelisted as well. |
Seen here on m5.large |
We're hitting this same issue; we can see from the logs that k8s decided only 2 of our 3 nodegroups were similar (by looking at the "Splitting..." log messages), despite all non-ignored labels being identical. These were m5s, so it's possible our problem is fixed by #2462, but in general it would be very nice to have some logging in the comparator that says why it decided 2 nodegroups were dissimilar, would have made it much easier to diagnose this. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale See #1676 (comment) for list of some of the reasons this might be happening. |
I was looking into this issue last week and came up with this PR which you may all be interest in #3124 In my debugging, I was seeing a difference of 172032Ki across availability zones of m5.xlarge instances in us-east-2. I will add I was using the cluster-api provider rather than the AWS provider so this bug affects multiple providers. My finding was that the value of To avoid confusion, I've reworked this difference calculation to use Kubernetes Quantities and prevent conversions to integers where possible to reduce the likelihood of mistakes being made in the future. So now the Before this patch I added a bunch of debug logic to see exactly the values that the code was receiving that lead me to this discovery. I was able to consistently reproduce the problem and now, with this fix, I can see that the values coming through are being compared properly to the 256Mi tolerance. What I would like to understand is why there are differences across instances (I've seen across AZs and within AZs)? In my testing, all machines booted from the same configuration, same AMI, so there should be no difference in the kernel version as reported earlier. Also, I would like to understand which instance types are affected by this memory difference, is it all Nitro based EC2 instances? I've seen mention of C5 and M5, are R5 and T3 instances as well? |
I've been doing some more testing of this today and have come up with some more results. Managed to find differences on the following instance types:
The large differences have only been seen on Nitro instances, but there are some differences on older instances as well The larger differences are about 1% difference (1.14, 1.05, 0.7%). Perhaps we should just allow a small difference as the allocatable and free are allowing? Are there any problems or implications with that, that anyone can think of? |
When I first noticed this, the differences I observed were way, way (way!) smaller than 128KiB. Sometimes just 8Kb. I just scaled this up as a first pass in absence of any hard numbers and also not to dramatically perturb the existing behaviour. |
That makes sense! Thanks for chipping in. Do you happen to remember if you ever tested with 5th generation instances? I've just had an update from an AWS contact who suggested it may be due to having different CPUs supporting various instance types. EG you may get two instances from the same type but one may run a Skylake and the other a Cascade lake. Perhaps there are also subtle difference in memory capacity for these two generations |
My guess at the time was "video memory". And I think I concluded on that for the m4 instance types we were using by looking at the dmesg output. I'm sure BIOS revisions could also play a part here. But dmesg should help show what's what. |
I don't know if this is relevant here, but spot-enabled auto scaling groups typically have multiple instance types attached to them, and one of them will be selected based on spot market costs (I guess) at the time of creation. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi everyone, Over the past few days, I’ve been analyzing the AWS EC2 nodes currently running across multiple Kubernetes clusters in different regions, with various instance types. Here are the memory capacity differences I’ve found:
In summary, the differences range from 2% to 7%, which seems quite significant and inconsistent. I retrieved the memory capacity from the For additional context, we are using Karpenter with the Nitro hypervisor. Does anyone have any insights on why these discrepancies occur, and what percentage difference should be considered acceptable when calculating general memory capacity? I need to establish a baseline for memory capacity calculations. Thank you! |
Hi,
I'm testing the cluster autoscaler on our AWS EKS 1.12 cluster.
I created 3 identical ASG's in zones a/b/c, and created a test deployment using a basic nginx pod, which I scale up with commands like
kubectl -n playground scale --replicas=4 deployment nginx-scaleout
I've sized the pods so that 2 will fit on each node.
I started with 3 nodes, one per AZ, and began scaling up the deployment. I saw it add nodes evenly at first so that each zone had 2 nodes. I then scaled up further until I had 3/3/2 nodes across the zones (so far so good), but the next time it scaled up it added a fourth in zone A so I had 4/3/2, but I'm unsure why it did this instead of adding a new node in zone C?
The relevant part of the log is this:
I0515 10:17:35.100431 1 static_autoscaler.go:121] Starting main loop
I0515 10:17:35.133540 1 leaderelection.go:227] successfully renewed lease kube-system/cluster-autoscaler
I0515 10:17:35.743084 1 auto_scaling_groups.go:320] Regenerating instance to ASG map for ASGs: [eks-zeus-autoscaler20190501200602699100000004 eks-zeus-scalemultiaz-a-20190514154009657700000006 eks-zeus-scalemultiaz-b-20190514154009657700000004 eks-zeus-scalemultiaz-c-20190514154009657700000005]
I0515 10:17:35.845006 1 aws_manager.go:157] Refreshed ASG list, next refresh after 2019-05-15 10:17:45.84499724 +0000 UTC m=+45941.859559752
I0515 10:17:35.845350 1 utils.go:552] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0515 10:17:35.845378 1 static_autoscaler.go:252] Filtering out schedulables
I0515 10:17:35.942508 1 static_autoscaler.go:262] No schedulable pods
I0515 10:17:35.942538 1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-z2pmd is unschedulable
I0515 10:17:35.942546 1 scale_up.go:263] Pod playground/nginx-scaleout-85cf87558d-w2h2j is unschedulable
I0515 10:17:35.942667 1 scale_up.go:300] Upcoming 0 nodes
I0515 10:17:35.942721 1 utils.go:208] Pod nginx-scaleout-85cf87558d-z2pmd can't be scheduled on eks-zeus-autoscaler20190501200602699100000004, predicate failed: GeneralPredicates predicate mismatch, reason: node(s) didn't match node selector
I0515 10:17:35.942853 1 utils.go:198] Pod nginx-scaleout-85cf87558d-w2h2j can't be scheduled on eks-zeus-autoscaler20190501200602699100000004. Used cached predicate check results
I0515 10:17:35.942869 1 scale_up.go:406] No pod can fit to eks-zeus-autoscaler20190501200602699100000004
I0515 10:17:36.042655 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-a-20190514154009657700000006 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042693 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-b-20190514154009657700000004 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042705 1 waste.go:57] Expanding Node Group eks-zeus-scalemultiaz-c-20190514154009657700000005 would waste 50.00% CPU, 47.28% Memory, 48.64% Blended
I0515 10:17:36.042721 1 scale_up.go:418] Best option to resize: eks-zeus-scalemultiaz-a-20190514154009657700000006
I0515 10:17:36.042732 1 scale_up.go:422] Estimated 1 nodes needed in eks-zeus-scalemultiaz-a-20190514154009657700000006
I0515 10:17:36.042894 1 scale_up.go:501] Final scale-up plan: [{eks-zeus-scalemultiaz-a-20190514154009657700000006 3->4 (max: 6)}]
I0515 10:17:36.042918 1 scale_up.go:579] Scale-up: setting group eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4
I0515 10:17:36.042963 1 auto_scaling_groups.go:203] Setting asg eks-zeus-scalemultiaz-a-20190514154009657700000006 size to 4
And my configuration looks like this:
`spec:
containers:
image: k8s.gcr.io/cluster-autoscaler:v1.12.3`
Any help would be appreciated, Thanks.
The text was updated successfully, but these errors were encountered: