Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster-autoscaler not balancing similar node groups on AWS #3515

Closed
binman-docker opened this issue Sep 15, 2020 · 25 comments · Fixed by #3596
Closed

Cluster-autoscaler not balancing similar node groups on AWS #3515

binman-docker opened this issue Sep 15, 2020 · 25 comments · Fixed by #3596

Comments

@binman-docker
Copy link
Contributor

Environment configuration:

  • Three ASGs, one in each of three zones in the same region (this is recommended by AWS and Autoscaler)
  • Same instance type (m5.xlarge), same AMI
  • autoscaler v1.1.9.0
  • Configuration options:
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycluster
            - --namespace=infra-system
            - --cluster-name=mycluster
            - --write-status-configmap=true
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --balancing-ignore-label=ec2-tag.docker.com/aws-autoscaling-groupName,ec2-tag.docker.com/main-stack-name,ec2-tag.docker.com/role,topology.ebs.csi.aws.com/zone,failure-domain.beta.kubernetes.io/zone,kubernetes.io/hostname
  • Example of labels on a node in the cluster:
"labels": {
    "beta.kubernetes.io/arch": "amd64",
    "beta.kubernetes.io/instance-type": "m5.xlarge",
    "beta.kubernetes.io/os": "linux",
    "ec2-tag.docker.com/aws-autoscaling-groupName": "mycluster-us-east-1a",
    "ec2-tag.docker.com/main-stack-name": "staging",
    "ec2-tag.docker.com/role": "eks",
    "failure-domain.beta.kubernetes.io/region": "us-east-1",
    "failure-domain.beta.kubernetes.io/zone": "us-east-1a",
    "kubernetes.io/arch": "amd64",
    "kubernetes.io/hostname": "ip-10-128-1-1.docker-staging.io",
    "kubernetes.io/os": "linux",
    "topology.ebs.csi.aws.com/zone": "us-east-1a"
}

Action:
Run a deployment with enough replicas and CPU/Mem reservation to require several new nodes.

Expected outcome:
Cluster-autoscaler sees the three groups as "similar" and increases capacity in each by roughly equal amounts.

Actual outcome:
The cluster-autoscaler only selects a single group to increase by 3, instead of increasing by 1 in each of the 3 groups.

I0909 23:25:57.010598       1 filter_out_schedulable.go:170] 7 pods were kept as unschedulable based on caching
I0909 23:25:57.010612       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0909 23:25:57.010626       1 filter_out_schedulable.go:82] No schedulable pods
I0909 23:25:57.010651       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-gvxqx is unschedulable
I0909 23:25:57.010662       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-5qw4c is unschedulable
I0909 23:25:57.010667       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-lhz72 is unschedulable
I0909 23:25:57.010671       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-fplpj is unschedulable
I0909 23:25:57.010676       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-wcg84 is unschedulable
I0909 23:25:57.010681       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-d7t4x is unschedulable
I0909 23:25:57.010686       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-rkvx6 is unschedulable
I0909 23:25:57.010691       1 klogx.go:86] Pod example/nginx-deployment-c95d44984-6hldq is unschedulable
I0909 23:25:57.010749       1 scale_up.go:364] Upcoming 0 nodes
I0909 23:25:57.012026       1 waste.go:57] Expanding Node Group mycluster-us-east-1c would waste 33.33% CPU, 94.86% Memory, 64.10% Blended
I0909 23:25:57.012045       1 waste.go:57] Expanding Node Group mycluster-us-east-1a would waste 33.33% CPU, 94.86% Memory, 64.10% Blended
I0909 23:25:57.012068       1 waste.go:57] Expanding Node Group mycluster-us-east-1b would waste 33.33% CPU, 94.92% Memory, 64.13% Blended
I0909 23:25:57.012084       1 scale_up.go:456] Best option to resize: mycluster-us-east-1c
I0909 23:25:57.012097       1 scale_up.go:460] Estimated 3 nodes needed in mycluster-us-east-1c
I0909 23:25:57.012175       1 scale_up.go:574] Final scale-up plan: [{mycluster-us-east-1c 1->4 (max: 10)}]

What I've tried:

Unfortunately I have nothing else to go on, even the debug logs don't have any info on why the groups are not considered similar or why the autoscaler isn't balancing across them. Any advice appreciated, thanks!

@ellistarn
Copy link
Contributor

I'm assuming you're using balance-similar-node-groups since you rely on EBS? Otherwise you can just put multiple AZs in one ASG and let ASG do the balancing. As I understand it #3124 should have resolved your issue.

@binman-docker
Copy link
Contributor Author

That's correct, we use one group per AZ because we have some deployments that require persistent storage (EBS).

I also thought #3124 would have done the trick but no such luck it seems. For reference, here are memory specs on some nodes in a cluster that isn't balancing properly:

$ kubectl get nodes -o jsonpath='{range .items[*]}{@.status.allocatable.memory}{" "}{@.metadata.labels.failure\-domain\.beta\.kubernetes\.io\/zone}{"\n"}'
13359464Ki us-east-1a
13204840Ki us-east-1a
13359464Ki us-east-1a
13359464Ki us-east-1b
13204840Ki us-east-1b
13204840Ki us-east-1b
13204840Ki us-east-1b
13204840Ki us-east-1c
13204840Ki us-east-1c
13204840Ki us-east-1c

Which translates to roughly 1.1%, which should be within the new 1.5% limit.

@TBBle
Copy link

TBBle commented Sep 16, 2020

If I'm reading the source for 1.19.0 correctly, the status.allocatable.* difference cap is 5%, so 1.1% is well within the range.

It's the status.capacity.memory that has the 1.5% difference allowance introduced in #3124.

So perhaps check status.capacity.memory for the nodes, in case there's some larger variation being seen here, and the allowance needs to be bumped again.

Also, it shouldn't make a difference, but failure-domain.beta.kubernetes.io/zone and kubernetes.io/hostname are already in the BasicIgnoredLabels list, so you shouldn't need to specify them.

@binman-docker
Copy link
Contributor Author

Ah right, here's capacity on my test cluster. Also roughly 1.1%:

$ kubectl get nodes -o jsonpath='{range .items[*]}{@.status.capacity.memory}{" "}{@.metadata.labels.failure\-domain\.beta\.kubernetes\.io\/zone}{"\n"}'
15950184Ki us-east-1b
15950184Ki us-east-1c
16122216Ki us-east-1c
15950184Ki us-east-1a

Also, it shouldn't make a difference, but failure-domain.beta.kubernetes.io/zone and kubernetes.io/hostname are already in the BasicIgnoredLabels list, so you shouldn't need to specify them.

Yeah I wasn't able to trace apiv1.LabelZoneFailureDomain back to an actual list somewhere; I figured they would be included in that list wherever it was, but also figured it didn't hurt to rule it out.

@ellistarn
Copy link
Contributor

ellistarn commented Sep 16, 2020

This could potentially be an issue with daemonsets or system pods. Can you check values for all 3

	// MaxAllocatableDifferenceRatio describes how Node.Status.Allocatable can differ between
	// groups in the same NodeGroupSet
	MaxAllocatableDifferenceRatio = 0.05
	// MaxFreeDifferenceRatio describes how free resources (allocatable - daemon and system pods)
	// can differ between groups in the same NodeGroupSet
	MaxFreeDifferenceRatio = 0.05
	// MaxCapacityMemoryDifferenceRatio describes how Node.Status.Capacity.Memory can differ between
	// groups in the same NodeGroupSet
	MaxCapacityMemoryDifferenceRatio = 0.015

@binman-docker
Copy link
Contributor Author

Ok so digging through what "Free" means in this context because it's not a value you can get with the node info (correct me if I'm wrong anywhere, this is my first time really reading through this):

The autoscaler looks to make sure "free" resources on a node are within 5% as part of similarity checks:

if !resourceMapsWithinTolerance(free, MaxFreeDifferenceRatio) {

// MaxFreeDifferenceRatio describes how free resources (allocatable - daemon and system pods)
// can differ between groups in the same NodeGroupSet
MaxFreeDifferenceRatio = 0.05

It does so by taking the Allocatable value and then subtracting the resources Requested:

for res, quantity := range node.Requested.ResourceList() {
freeRes := node.Node().Status.Allocatable[res].DeepCopy()
freeRes.Sub(quantity)
free[res] = append(free[res], freeRes)

The type definition for Requested says it accounts for all pods on the node:

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1/types.go#L184-L186

This would maybe make sense if it was only accounting for daemonsets, but accounting for ALL pods on a running node means you will often not be within 5%. I'm not sure this would ever work for systems actually in use unless they had very uniform deployments.

Maybe I'm reading this wrong? What's the point in comparing Free resources to determine similarity of node groups?

@ellistarn
Copy link
Contributor

Nice digging! If true, this sounds like a bug to me.

@TBBle
Copy link

TBBle commented Sep 17, 2020

I used the k8s.io/api/core/v1 docs to turn the constants into text, when verifying the existing basic label list.

@TBBle
Copy link

TBBle commented Sep 17, 2020

I had been wondering about the 'Free' check. Under what circumstances is that useful, even if it worked as-described?

I'd think that if it was trying to detect "Some DaemonSet is only present on some of these nodegroups", then wouldn't there be a label or something that the DaemonSet is using to distinguish the nodes to run on, and it would mark the groups dissimilar.

Hmm. If you do have DaemonSets that have Pods on some of your nodegroups but not others, then this is working-as-intended. I assume it's not the problem here, but maybe worth checking?

Inside the fold, I go looking for a place to fix bad 'Free' calculations, and it turns out the code's there.

I guess if the DaemonSet was using a nodeSelector for one of the ignored labels, e.g., you were running a DaemonSet only in one AZ, but not another, then this check is relevant, if you expect that difference to mark the node-groups dissimilar.

I'd expect a difference like that would be handled elsewhere, when CA is determining how many nodes to scale up, because I assume it's still using the individual node-group details when calculating "available resources for pods", rather than averaging the "available resources for pods" across the balanced node-groups.

Which suggests to me that there is code in the codebase somewhere to calculate the "Free" value we actually need here.

The 'Free' check is also not going to work when there are scale-from-zero node groups at play, as there will be no resource allocations for DaemonSets visible there. If I recall correctly, the Capacity and Allocatable values for scale-from-zero node groups are hard-coded in the Cloud Provider code, so it's only Free that won't have any values. And if only one of a set of three otherwise-similar node-groups is scaled to zero (the other two presumably being at 1 node each), then the 0-node node group will show totally different Free from the others, and will be incorrectly treated as dissimilar. Edit: ScaleUp knows what DaemonSets are on the system, and runs the predicates to determine DaemonSet resource costs when scaling a node from 0.

And that last edit points us at a fix for this issue, if it's been diagnosed correctly: We can use the NodeInfo Pods list which in both the "scale from zero" and "copy an existing node in the group" cases, only contains the DaemonSets.

It's also possible that the bug is that after building the node list, neither of these code-paths appears to recalculate Requested after trimming/creating the pod list. So that might be a simpler solution. I'd have kind-of expected that to be done by k8s.io/kubernetes/pkg/scheduler/framework/v1alpha1.NewNodeInfo, so possibly this is a bug introduced by changes in the scheduler framework in #3083, for 1.19.

Edit: NewNodeInfo does appear to calculate resources from the provided pods, so I'm back to square one.


Although Requested states it accounts for all pods on the node, at this point in the code flow, the NodeInfo is a template NodeInfo for the to-be-newly-created nodes, whose pod-list only contains daemonsets (and mirror pods if using a live node rather than scale-from-zero template), so the current calculation appears correct.

And from the hidden details, one question does appear: Is there any chance some of your existing nodes have Static or Mirror pods (annotation kubernetes.io/config.mirror) on them? They will be taken into account when calculating 'Free' resources, but are not visible in the 'Scale-from-zero' case, which may cause failed similarity checks, but probably isn't the problem here, as all your node groups were already extant.

I suspect at this point, a controlled reproduction case with IsCloudProviderNodeInfoSimilar modified to dump NodeInfo for each comparison might be needed to diagnose this, since it must be in the modified NodeInfo structure, rather than the state of the real nodes in the system.

@binman-docker
Copy link
Contributor Author

Under what circumstances is that useful, even if it worked as-described?

Yeah this seems odd to me. If capacity and allocatable are similar, then the node groups are similar. Anything else running on them that would make them different is a concern outside of cluster-autoscaler in my opinion, since it's an infrastructure-level construct and shouldn't be concerned about system/application-level things like pods (other than "I need to make more room for these pods"). For use cases that I would assume are less common, where you would consider groups to be different based on certain pods running on them, you would just use labels to indicate them being different.

If you do have DaemonSets that have Pods on some of your nodegroups but not others, then this is working-as-intended. I assume it's not the problem here, but maybe worth checking?

Confirmed, all of the groups in my scenario are the same and have the same daemonsets running on them.

Is there any chance some of your existing nodes have Static or Mirror pods (annotation kubernetes.io/config.mirror) on them?

We don't configure any, unless EKS / EBS CSI driver / something else internal to the system is doing it. I don't see any mirror pods at least.

I suspect at this point, a controlled reproduction case with IsCloudProviderNodeInfoSimilar modified to dump NodeInfo for each comparison might be needed to diagnose this, since it must be in the modified NodeInfo structure, rather than the state of the real nodes in the system.

If someone can build such (or really just make the proposed changes on a branch somewhere, I can probably figure out how to build it), I'm happy to test it.

@tsunny
Copy link

tsunny commented Oct 1, 2020

@ellistarn - Any update on this? Thanks!

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Oct 2, 2020

Balancing similar nodegroups have 2 distinct checks:

  • Whether node groups are similar (ie. what is discussed in comments above).
  • Whether the same set of pending pods can go to all those nodegroups. Even if nodegroups are similar for the given definition, some pods may only be able to run on a specific one (most common reasons are when some pods need to run in a specific zone because of storage or zonal topology key pod (anti-)affinity). With zone balancing CA does binpacking for just one nodegroup (ie. simulates scheduling all those pods in the same zone). For other zones it just checks which pods would be able to run on a completely empty template node. If the result of this check is the same for all zones CA assumes it doesn't matter how nodes are spread across zones and it balances them. However, if the result is not the same we can't just assume it doesn't matter how we spread nodes across zones and so CA creates all nodes in a single zone (the one used in binpacking).

tl;dr - Balancing will not happen in a scale-up where at least one of pending pods uses EBS.

edit: To provide some more details - CA algorithm can only do binpacking for a single nodegroup. It can't calculate a mixed scale-up (ie. a scale-up that would resize multiple nodegroups). Balancing is just a heuristic that tries to workaround this by applying a result of binpacking done on a single nodepool to multiple nodepools, which is where its limitations come from. This is also where limitations in comparing nodegroups come from - it would clearly be better for many use-cases to balance nodegroups that differ more. However, the code just assumes the result of binpacking is exactly the same and so the nodegroups must be as close as possible from scheduling perspective.

@TBBle
Copy link

TBBle commented Oct 2, 2020

Assuming the original bug report was using this example nginx-deployment and not just a coincidental name, they shouldn't have been using EBS (or any PVC in fact), so that shouldn't be a problem here.

Perhaps @binman-docker, you could post the Pod details from your test, to check that there's nothing else in the Pod affinity etc. that would be failing the second step that @MaciekPytel has described.

Edit: Poking at the 1.19.0 code, if the second step (which I believe is a call to filterNodeGroupsByPods) is rejecting a node group, then we should see an Info-level message in the logs when a group is rejected because it cannot accept all the pods that the chosen group could accept.

So I don't think we're hitting this problem, it still seems like the comparator is reporting false incorrectly/unexpectedly.

@binman-docker
Copy link
Contributor Author

Yeah there are no EBS volumes used in this test, here's the yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: test
  labels:
    app: nginx
spec:
  replicas: 20
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        resources:
                  limits:
                    cpu: 1
                    memory: 300Mi
                  requests:
                    cpu: 1
                    memory: 300Mi

TBBle added a commit to TBBle/autoscaler that referenced this issue Oct 4, 2020
TBBle pushed a commit to TBBle/autoscaler that referenced this issue Oct 4, 2020
Debug hacks to help diagnose kubernetes#3515

Signed-off-by: Paul "Hampy" Hampson <[email protected]>
@TBBle
Copy link

TBBle commented Oct 4, 2020

I've gone ahead and created a branch with logging for all 'not-similar' rejection paths (at --v=4 per the original repro case): https://github.com/TBBle/autoscaler/tree/cluster-autoscaler-1.19.0-noisy-IsCloudProviderNodeInfoSimilar

I've only build-tested this, and it's not intended for submission since it would generate way too much noise on a production cluster.

It should be pretty easy to produce a container image for this. Assuming you've followed the Getting the Code instructions, but with my branch checked out, it's a one-liner:

make -C cluster-autoscaler dev-release REGISTRY="your_container_registry"

@binman-docker
Copy link
Contributor Author

Thank you! I will test this week and report back.

@binman-docker
Copy link
Contributor Author

Thanks again for those edits. It seems like the balancing-ignore-label flag isn't working as expected. The cluster-autoscaler logs show the groups were considered not similar because of some labels, even though they are in the list.

See first comment for the configuration. Confirmed from the logs:

I1008 23:21:35.180197       1 flags.go:52] FLAG: --balancing-ignore-label="[ec2-tag.docker.com/aws-autoscaling-groupName,ec2-tag.docker.com/main-stack-name,ec2-tag.docker.com/role,topology.ebs.csi.aws.com/zone,failure-domain.beta.kubernetes.io/zone,kubernetes.io/hostname]"

Group similarity rejected thanks to the topology.ebs.csi.aws.com/zone label:

I1009 00:01:26.273760       1 compare_nodegroups.go:89] compareLabels: mismatched label topology.ebs.csi.aws.com/zone: [us-east-1b us-east-1a]
I1009 00:01:26.273775       1 balancing_processor.go:66] FindSimilarNodeGroups(mycluster-us-east-1b): mycluster-us-east-1a rejected as not-similar

And again thanks to the label ec2-tag.docker.com/aws-autoscaling-groupName (also in the ignore list):

I1009 00:01:26.273802       1 compare_nodegroups.go:89] compareLabels: mismatched label ec2-tag.docker.com/aws-autoscaling-groupName: [mycluster-us-east-1b mycluster-us-east-1c]
I1009 00:01:26.273812       1 balancing_processor.go:66] FindSimilarNodeGroups(mycluster-us-east-1b): mycluster-us-east-1c rejected as not-similar

There's no documentation for the balancing-ignore-label parameter in the FAQ: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

But I'm pretty sure I've formatted it correctly; it's a multiStringFlag just like node-group-auto-discovery, which shows similar usage here for example: https://github.com/mumoshu/autoscaler/blob/3e8cc02243550fdde8f47d15fd7358310c1a6800/cluster-autoscaler/cloudprovider/aws/README.md

@TBBle
Copy link

TBBle commented Oct 9, 2020

Okay, from that clue, I now know what's going on. It turns out balancing-ignore-label and node-group-auto-discovery is a false-equivalence.

- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycluster

is a single string-value option asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/mycluster, which is carried through to parseASGAutoDiscoverySpec and broken up therein.

Similarly,

- --balancing-ignore-label=ec2-tag.docker.com/aws-autoscaling-groupName,ec2-tag.docker.com/main-stack-name,ec2-tag.docker.com/role,topology.ebs.csi.aws.com/zone,failure-domain.beta.kubernetes.io/zone,kubernetes.io/hostname

is a single string-value option ec2-tag.docker.com/aws-autoscaling-groupName,ec2-tag.docker.com/main-stack-name,ec2-tag.docker.com/role,topology.ebs.csi.aws.com/zone,failure-domain.beta.kubernetes.io/zone,kubernetes.io/hostname which is carried through to CreateAwsNodeInfoComparator, and taken as a single label to ignore.

That's what the FLAG: log in your last comment is telling us, it's showing that the value was an array containing a single string, not an array of strings as we needed.

So it's user-error due to total lack of documentation. The multiStringFlag handler handles multiple occurances of the flag as a collection of mutiple strings, it's not handling comma-separated values in a single flag.

Try:

- --balancing-ignore-label=ec2-tag.docker.com/aws-autoscaling-groupName
- --balancing-ignore-label=ec2-tag.docker.com/main-stack-name
- --balancing-ignore-label=ec2-tag.docker.com/role
- --balancing-ignore-label=topology.ebs.csi.aws.com/zone
- --balancing-ignore-label=failure-domain.beta.kubernetes.io/zone
- --balancing-ignore-label=kubernetes.io/hostname

although as-noted, that last two should not be necessary.

This would have been much easier to diagnose if, for example, CreateAwsNodeInfoComparator logged the resulting list of ignored labels, as we'd have immediately seen the "single long label with commas" issue, which wasn't super-obvious from from the FLAG log message, until we knew that node-label mismatch was the failure reason.

I wonder if we could also improve this by having the --balancing-ignore-label fail if given something that is not a valid k8s label name, e.g., something with commas like this should have failed startup.

@binman-docker
Copy link
Contributor Author

Yeah that bracketed list being quoted seemed off to me, which is why I tried to double-check with a similar flag.

Let me confirm it works with the hopefully correct formatting, and then I'll open a PR to update the docs.

@binman-docker
Copy link
Contributor Author

Yep that's better:

I1009 18:39:23.482255       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-qd8gz is unschedulable
I1009 18:39:23.482264       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-kzjj5 is unschedulable
I1009 18:39:23.482268       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-dvmb8 is unschedulable
I1009 18:39:23.482272       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-n4tlw is unschedulable
I1009 18:39:23.482276       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-lqs8c is unschedulable
I1009 18:39:23.482281       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-fvqf2 is unschedulable
I1009 18:39:23.482285       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-6mgzh is unschedulable
I1009 18:39:23.482290       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-42mtk is unschedulable
I1009 18:39:23.482295       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-mpn4q is unschedulable
I1009 18:39:23.482300       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-pdcrh is unschedulable
I1009 18:39:23.482304       1 klogx.go:86] Pod hub/nginx-deployment-c95d44984-fdv9t is unschedulable
I1009 18:39:23.482339       1 scale_up.go:364] Upcoming 0 nodes
I1009 18:39:23.483962       1 scale_up.go:456] Best option to resize: mycluster-us-east-1c
I1009 18:39:23.483979       1 scale_up.go:460] Estimated 4 nodes needed in mycluster-us-east-1c
I1009 18:39:23.484034       1 scale_up.go:566] Splitting scale-up between 3 similar node groups: {mycluster-us-east-1c, mycluster-us-east-1a, mycluster-us-east-1b}
I1009 18:39:23.484052       1 scale_up.go:574] Final scale-up plan: [{mycluster-us-east-1a 1->3 (max: 10)} {mycluster-us-east-1b 1->3 (max: 10)}]

Ideally I would have liked to see it split across all three groups, but I assume it's trying to scale evenly (2 and 2) rather than unevenly (1 and 1 and 2), and that would be a separate issue.

I'll open a PR today to update the docs. Thank you to everyone who helped diagnose!

@ArchiFleKs
Copy link

When using custom launch template, these label are added:

  • eks.amazonaws.com/sourceLaunchTemplateId=lt-0d60b9506f387d2d6
  • eks.amazonaws.com/sourceLaunchTemplateVersion=1

I guess there should be ignore also for this to work ? because LT and version might be different on similar pool

@artificial-aidan
Copy link

eks.amazonaws.com/nodegroup-image is also added. If your nodegroups had different images this could be an issue.

@thiagosantosleite
Copy link

@artificial-aidan does you have to add "eks.amazonaws.com/nodegroup","beta.kubernetes.io/instance-type" and "node.kubernetes.io/instance-type" to --balancing-ignore-label?

@ArchiFleKs
Copy link

ArchiFleKs commented Jan 25, 2022

I'm using these in autoscaler config:

  balancing-ignore-label_1: topology.ebs.csi.aws.com/zone
  balancing-ignore-label_2: eks.amazonaws.com/nodegroup
  balancing-ignore-label_3: eks.amazonaws.com/nodegroup-image
  balancing-ignore-label_4: eks.amazonaws.com/sourceLaunchTemplateId
  balancing-ignore-label_5: eks.amazonaws.com/sourceLaunchTemplateVersion

default topology are already added by cluster autoscaler and if you are using only 1 instance type per node pool you should not add instance-type.

That make me wonder how people using multiple instance types per asg are using the balance similar node group ? On on which label is it based ? I mean it works with CPU/RAM resources but when doing scale to and from 0 I guess it needs a selector or an indication of which ASG to scale

@thiagosantosleite
Copy link

looks like the reverse idea is much better:
#4174

in my case I have 2 spot nodepools with 3 similar types from same family

thanks @ArchiFleKs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants