Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase or Allow controlling the value of MaxCapacityMemoryDifferenceRatio when comparing Node Groups #5381

Closed
CCOLLOT opened this issue Dec 22, 2022 · 2 comments · Fixed by #5402
Assignees
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.

Comments

@CCOLLOT
Copy link

CCOLLOT commented Dec 22, 2022

Which component are you using?:
I'm using cluster-autoscaler's balance-similar-node-groups feature (on EKS-based clusters)

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.
I am facing the same issue described here.

  • We use AWS Spot Node Groups with many instance types.
  • For a cluster, we have 3 node groups (1 per Availability Zone) and I get an uneven balance of nodes across AZ because some nodes are not considered similar.
    • This is happening because the capacity.memory 's difference is too high. I understand the purpose of the MaxCapacityMemoryDifferenceRatio not being too high, but in this case, we want to run spot instances with as many instance types in the mix in order to avoid running out of options (and having the cluster shrinking because the cloud provider reclaims the instances)
      It seems that in my case, I end up with nodes of relatively similar capacity (m5d.xlarge and t3.xlarge instance types are both supposed to have 4vCPU and 16Gib), having a capacity.memory difference ratio higher than the currently implemented 1.5% MaxCapacityMemoryDifferenceRatio as specified in this part of the code.

Example with an m5d.xlarge node vs t3.xlarge node
m5d.xlarge:

allocatable:
    attachable-volumes-aws-ebs: "25"
    cpu: 3920m
    ephemeral-storage: "95551679124"
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 14902308Ki
    pods: "58"
capacity:
    attachable-volumes-aws-ebs: "25"
    cpu: "4"
    ephemeral-storage: 104845292Ki
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 15919140Ki
    pods: "58"

t3.xlarge

allocatable:
  attachable-volumes-aws-ebs: "25"
  cpu: 3920m
  ephemeral-storage: "95551679124"
  hugepages-1Gi: "0"
  hugepages-2Mi: "0"
  memory: 15186976Ki
  pods: "58"
capacity:
  attachable-volumes-aws-ebs: "25"
  cpu: "4"
  ephemeral-storage: 104845292Ki
  hugepages-1Gi: "0"
  hugepages-2Mi: "0"
  memory: 16203808Ki
  pods: "58"

Describe the solution you'd like.:
Allow changing this maxDifferenceRatio to fit cloud/user-specific use cases in an cloud-independant way.
I suggest allowing controlling this value through an optional config flag to be specified when starting cluster-autoscaler. For example:

spec:
      containers:
      - command:
        - ./cluster-autoscaler
        - --v=4
        - ...
        - --balance-similar-node-groups
        - --balance-similar-node-groups-max-capacity-memory-difference-ratio=0.020

Describe any alternative solutions you've considered.:

Optionally, create a new NodeInfoComparator which would be trusting the end-user's judgment.
For instance, we could always consider 2 node groups to be similar if the nodes have a specific label in common.
Example:

labels:
  cluster-autoscaler-similar-node-group-class: foo
@CCOLLOT CCOLLOT added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 22, 2022
@Bryce-Soghigian
Copy link
Member

/assign Bryce-Soghigian

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Jan 8, 2023

Added the difference ratios to Cluster Autoscaler flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants