Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

priority expander not working as expected for AWS Spot failure #4958

Closed
SahilKhullar11 opened this issue Jun 10, 2022 · 2 comments
Closed

priority expander not working as expected for AWS Spot failure #4958

SahilKhullar11 opened this issue Jun 10, 2022 · 2 comments
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider

Comments

@SahilKhullar11
Copy link

We have below configuration for priority expander configmap and set --max-node-provision-time=5m0s.
priorities: |-
30:
- .*
40:
- .*OnDemand0.*
- .*OnDemand1.*
- .*OnDemand2.*
50:
- .*Spot0.*
- .*Spot1.*
- .*Spot2.*

CAS tries MediumDevNodeSpot1AutoScalingGroup and then marks it is not ready for scale up after 5 minutes.
Then it tries MediumDevNodeSpot0AutoScalingGroup and then marks it is not ready for scale up after 5 minutes.
Then again it tries MediumDevNodeSpot1AutoScalingGroup instead of moving to next one. Also this Indicates it is marking it as healthy again after 5 minutes and trying it.

Issue: This way, it is just stuck in loop for Spot0, Spot1, Spot2 and not going to OnDemand ASGs.

Questions:

  1. Is their a way to make them as "not ready for scale up" for more than 5 minutes?
  2. If no spot is present, can we move to next priority order in 30s/1 minute instead of waiting for 5 minutes(max-node-provision-time)?

Logs:
I0610 07:00:18.588135 1 priority.go:114] Successfully loaded priority configuration from configmap.
I0610 07:00:18.588420 1 priority.go:163] priority expander: MediumDevNodeSpot0AutoScalingGroup chosen as the highest available
I0610 07:00:18.588447 1 priority.go:163] priority expander: MediumDevNodeSpot1AutoScalingGroup chosen as the highest available
I0610 07:00:18.588458 1 priority.go:163] priority expander: MediumDevNodeSpot2AutoScalingGroup chosen as the highest available
I0610 07:00:18.588477 1 scale_up.go:468] Best option to resize: MediumDevNodeSpot1AutoScalingGroup
....
W0610 07:05:34.183561 1 scale_up.go:399] Node group MediumDevNodeSpot1AutoScalingGroup is not ready for scaleup - backoff
I0610 07:06:05.083339 1 priority.go:114] Successfully loaded priority configuration from configmap.
I0610 07:06:05.083595 1 priority.go:163] priority expander: MediumDevNodeSpot2AutoScalingGroup chosen as the highest available
I0610 07:06:05.083612 1 priority.go:163] priority expander: MediumDevNodeSpot0AutoScalingGroup chosen as the highest available
I0610 07:06:05.083630 1 scale_up.go:468] Best option to resize: MediumDevNodeSpot0AutoScalingGroup
....
W0610 07:10:19.431745 1 scale_up.go:399] Node group MediumDevNodeSpot1AutoScalingGroup is not ready for scaleup - backoff
W0610 07:11:20.558658 1 scale_up.go:399] Node group MediumDevNodeSpot0AutoScalingGroup is not ready for scaleup - backoff
....
I0610 07:11:20.582503 1 priority.go:114] Successfully loaded priority configuration from configmap.
I0610 07:11:20.582670 1 priority.go:163] priority expander: MediumDevNodeSpot1AutoScalingGroup chosen as the highest available
I0610 07:11:20.582688 1 priority.go:163] priority expander: MediumDevNodeSpot2AutoScalingGroup chosen as the highest available
I0610 07:11:20.582706 1 scale_up.go:468] Best option to resize: MediumDevNodeSpot1AutoScalingGroup

@gjtempleton gjtempleton added area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider labels Jun 14, 2022
@SahilKhullar11
Copy link
Author

SahilKhullar11 commented Jun 16, 2022

Found 4489. After using the latest image and updating permissions for CAS IAM role, it started working. Closing this issue.

@mohitreddy1996
Copy link

@SahilKhullar11

Please correct me if I am wrong, but even with #4489 , it still wouldn't try OnDemand ASGs since their priority is lower? AFAIU, with priority expander, only the node groups matching the highest priority are used always - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md#configuration (The priority should be a positive value. The highest value wins.).

Also the code looks like it - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/priority.go#L142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider
Projects
None yet
Development

No branches or pull requests

3 participants