Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for all-or-nothing scale-up strategy #6821
Add support for all-or-nothing scale-up strategy #6821
Changes from 1 commit
60ffbf1
4a9d939
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed? Will it not be checked in
155-156
andComputeExpansionOption
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In
155-156
we only exclude cases when there's at least one pod that can't be scheduled on nodes belonging to this group at all. If all pods are compatible with that node group, we'll just get a higher node count (possibly higher than group max size - but we can still balance across similar groups later).In
computeExpansionOption
we only cap the node count for special case (groups that scale from 0 to max).We don't check group max size for all cases in either of those places because of balancing across similar groups, which we only do just before executing scale-up.
So without this check, for a scale-up not affected by the above two cases we could create a node group, then decide not to scale it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think
MaxSize()
is checked before balancing node groups below, because the amount over max can be distributed to the similar node groups.But why check
newNodes
against the max size of just one group, if we're balancing between groups below (and verifying that the total capacity is big enough) anyway?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checked only for autoprovisioned group just before it's created. I think we currently don't have information how many similar groups there'll be. I suspect estimating this may require changes on cloud provider side (for example, if creating a group in one zone implicitly creates groups in another zone with identical taint/labels/resources).
We agreed with @kisieland offline that it's acceptable to limit it for autoprovisioning to max size of one group initially and leave an open issue for handling this better for cases where an autoprovisioned groups will have unobvious similar groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that the final total capacity check would still catch this (because the balancing processor also caps all the requests to MaxSize()), but this lets us avoid an unnecessary node group creation in that case?
In any case, this seems okay. As a nit, maybe leave a TODO for the issue when you have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly the case.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of my new test cases are in the other PR: #6824
I can add some here as well to have more granularity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test case which fails if
allOrNothing
is false. More extensive test cases remain in #6824There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for podsOrchestrator allOrNothing=false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is false and set in static_autoscaler.go, I'm just passing the variable here.
I wanted this to be set to whatever is the default for the autoscaler, and the custom orchestrators to override it only as needed (like atomic scale-up for ProvisioningRequests will).