Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EKS cluster autoscaler only using ASGs that have at least one node #1676

Closed
leonsodhi-lf opened this issue Feb 12, 2019 · 14 comments
Closed
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@leonsodhi-lf
Copy link

leonsodhi-lf commented Feb 12, 2019

CloudProvider: AWS EKS
Kubernetes version: 1.11.5
Cluster Autoscaler version: 1.3.6

With the autoscaler options shown below and multiple identical ASGs (other than AZ), new nodes always end up being created using an ASG that has at least 1 node. Looking at the logs, autoscaler considers these ASGs to have the least memory wastage:

waste.go:57] Expanding Node Group m5_large_az1 would waste 95.00% CPU, 99.09% Memory, 97.04% Blended
waste.go:57] Expanding Node Group m5_large_az2 would waste 95.00% CPU, 99.15% Memory, 97.07% Blended
waste.go:57] Expanding Node Group m5_large_az3 would waste 95.00% CPU, 99.15% Memory, 97.07% Blended
waste.go:57] Expanding Node Group m5_large_az4 would waste 95.00% CPU, 99.15% Memory, 97.07% Blended

If I manually set the desired count on the m5_large_az2 ASG to 1, I then see:

I0212 14:25:26.668623 1 waste.go:57] Expanding Node Group m5_large_az1 would waste 95.00% CPU, 99.09% Memory, 97.04% Blended
I0212 14:25:26.668656 1 waste.go:57] Expanding Node Group m5_large_az2 would waste 95.00% CPU, 99.09% Memory, 97.04% Blended
I0212 14:25:26.668668 1 waste.go:57] Expanding Node Group m5_large_az3 would waste 95.00% CPU, 99.15% Memory, 97.07% Blended
I0212 14:25:26.668678 1 waste.go:57] Expanding Node Group m5_large_az4 would waste 95.00% CPU, 99.15% Memory, 97.07% Blended

I believe this has started happening since I added taints to the nodes, and tolerations and nodeselectors to the pods.

Autoscaler options

--v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=false
- --balance-similar-node-groups=true
- --expander=least-waste     
@aleksandra-malinowska
Copy link
Contributor

That's probably because actual node's memory is slightly different from predicted memory based on machine type (due to kernel reservation, specific to a given OS/machine combination). It follows that least waste expander prefers existing groups as it believes they have less memory. #1643 will somehow improve the prediction by caching node templates, but it's not a complete solution: if there was any node in this group since last autoscaler restart, resources from that node will be used in simulations.

Some options to fix this:

  1. use a different expander or modify least waste expander to ignore negligible differences (e.g. less than 1%?)
  2. follow up on Adding ability to override allocatable resources via ASG tags.  #1656 and add overriding memory via tags, set tags based on actual resources (makes configuration harder)
  3. rely on caching in future versions or better yet, implement caching resources based on machine type in cloud provider module (still not a complete solution, but would work across groups - and we only really care about the case with both empty and non-empty node groups with the same machine type)

I would probably go with (1), others are more complicated.

@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Feb 12, 2019
@leonsodhi-lf
Copy link
Author

Thanks @aleksandra-malinowska. Focusing on (1), running another test with a pod that requests most of a node's memory resources seems to produce a larger difference between actual vs. predicted:

I0212 15:19:29.299066 1 waste.go:57] Expanding Node Group m5_large_az1 would waste 95.00% CPU, 30.25% Memory, 62.62% Blended
I0212 15:19:29.299080 1 waste.go:57] Expanding Node Group m5_large_az2 would waste 95.00% CPU, 25.63% Memory, 60.31% Blended
I0212 15:19:29.299091 1 waste.go:57] Expanding Node Group m5_large_az3 would waste 95.00% CPU, 30.25% Memory, 62.62% Blended
I0212 15:19:29.299098 1 waste.go:57] Expanding Node Group m5_large_az4 would waste 95.00% CPU, 30.25% Memory, 62.62% Blended

The rest of the suggested options sound promising, though. I'll think on it.

@leonsodhi-lf
Copy link
Author

@aleksandra-malinowska, in revisiting this I came up with an option 4 that, although not perfect, seems simple to implement and configure, and reliable.

Autoscaler could compare a label value on each node (e.g. cluster-autoscaler.kubernetes.io/hardware-configuration-id) and skip the resource check if the values are equivalent. A similar implementation is already used for GKE on 1.14.X.

Does this seem reasonable?

@aleksandra-malinowska
Copy link
Contributor

Does this seem reasonable?

Seems so, @MaciekPytel WDYT?

Although I wonder if it could be more generic than hard-coding a constant for each cloud provider. For example, why not make the key to use for node group comparison configurable?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2019
@leonsodhi-lf
Copy link
Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2019
@leonsodhi-lf
Copy link
Author

@MaciekPytel would you mind having a look at this issue? I've forked and applied the fix I proposed, so let me know if a PR would be preferable.

@JacobHenner
Copy link

This issue extends beyond detected memory capacity and template-estimated memory capacity. The issues I've found so far which preclude even scale-ups from 0 include:

  1. The template-based memory capacity is greater than the kubelet reported memory capacity, and on larger instances, this difference is greater than the toleration for similar amounts of memory. (For example, an m5.4xlarge instance is allocated 64GiB, but the OS presents approx 2.5GiB less)
  2. The k8s API reports capacity information that's not generated when the autoscaler simulates a templated node (e.g. hugepages). These capacity items are compared (for equality) to determine if ASGs are similar, so the similarity test will always fail.
  3. Similarly, the templated node reports items that are not present in the k8s API - e.g. the nvidia.com/gpu resource.
  4. Some of these capacity resources exist in both places, but are simulated with constants which don't match reality (e.g. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/aws_manager.go#L362) - there's even a comment stating this value is contrived.
  5. Some capacity items aren't simulated and need to be set using ASG tags - specifically ephemeral-storage. This isn't documented in the general or AWS FAQ.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2020
@JacobHenner
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 6, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
…etes#1676)

* Wait for webhooks server using probes

* Delete KueueReadyForTesting

* revert the setting of healthz

* Add a comment about the readyz probe
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants