-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design proposal for parallel drain #4766
Conversation
8a95da6
to
681e130
Compare
I left some comments. Most of them are just asking for clarification of some details that weren't obvious to me. Overall the proposal looks good to me. I know it was brought up on sig-meeting, but given the size of proposed changes I'd like to give one more shout out to other owners. Any thoughts on this proposal @feiskyer @towca @mwielgus @gjtempleton ? |
Thanks for reviewing! I don't see the comments though, did you forget to send them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, forgot to submit my review with all the comments
|
||
In each iteration the contents of candidate\_set will be updated. Nodes can be added, removed or stay in candidate\_set. For each node in the candidate\_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset. | ||
|
||
To trigger node deletion will use the already present ScaleDownUnneededTime and scaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate\_set for more than ScaleDownUnneededTime (or is not ready and is in candidate\_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly max-empty-bulk-delete flag will not be used by the new algorithm and it's role will effectively be filled by --max-scale-down-parallelism? If so is that something that should be stated explicitly? Would it make sense to default --max-scale-down-parallelism value to --max-empty-bulk-delete while we deprecated max-empty-bulk-delete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, mentioned this is the doc.
|
||
The existing scale down code lives mostly in the ScaleDown object, spanning scale\_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations. | ||
|
||
As a part of the refactoring, we will combine `FastGetPodsToMove` and `DetailedGetPodsForMove` into a single `PodsToMove` function. The `checkReferences` flag will be dropped and we will always do a detailed check. We are now using listers so doing a detailed check does not add extra API calls and should not add too much to execution time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for doing that. I wonder though: we currently do another simulation pass in TryToScaleDown. Is that needed after this change? It seems unclear how it could work with the new algorithm (presumably we'd need to repeat entire binpacking?) and it also feels pointless if we use the same PodsToMove implementation.
Am I correct in guessing you're planning to delete this part? If so maybe it's worth mentioning this more explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in the new implementation we'll only need one simulation pass. Updated to state this explicitly.
* List of NodeInfo computed at the beginning of previous algorithm iteration. | ||
* pdbs\_remaining\_disruptions - bookkeeping for graceful handling of PDBs | ||
([details](#pdbs-checking)). | ||
* recent\_evictions: set of pods that were recently successfully evicted from a node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this pods where the eviction request has been successful or just the ones where the pod is already gone? I'm wondering about the case of pod with long graceful termination - would it show up on the list? If not - should it also be included in "Verify the pods from nodes currently being scaled down can still be rescheduled" step below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods are only added to recent_evictions once they no longer running on a node. The two verification steps below are:
- checking if pods that are still running on a (currently drained) node can be scheduled somewhere
- checking if pods that are no longer running because they were evicted can be scheduled somewhere
recent_evictions exists only to check the second step. If a pod takes a lot of time to terminate, it will be verified in the first one.
* pod\_destination\_hints: | ||
* stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation. | ||
* map: source node names -> pod UID -> destination node names | ||
* previous\_node\_infos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it? The doc mentions updating this variable, but I can't find any reference to actually using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used to return unneeded nodes for deletion. The candidate_set is only a list of identifiers (node names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was meant as metadata for subsequent fetching of unneeded nodes / candidate_set. Removed, since it is both not the greatest idea and too low-level to include here.
|
||
Method steps in pseudocode: | ||
|
||
* Delete N empty nodes, up to MaxScaleDownParallelism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this say 'up to MaxScaleDownParallelism - len(deleted_set)'? Or do we intend to just delete up to MaxScaleDownParallelism nodes each loop? If so I'd rename the parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to include node deletions that started in previous iterations. Rephrased to make it clear.
target\_node\_pointer. | ||
|
||
In the initial version, vanilla SchedulerBasedPredicateChecker will be used. | ||
Implementing support for target\_node\_pointer is essentially an optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in our offline discussion, I'd only implement it if we actually see performance issues. Personally, I'm not convinced that we'll see performance issues (the binpacking isn't all that different from what we do in FilterOutSchedulable) and, even if we do I'm not convinced this optimization will help all that much. I can see how it can limit the number of Filters() calls, but it doesn't change how many PreFilters() we need to run and I wouldn't be surprised if it turns out that PreFilters() are what really matters (especially with #4745).
Of course this is just guessing on my side and I may be proven completely wrong :)
That being said - a relatively large part of algorithm description covers updates of target_node_pointer. I'm not going to block this PR on it, but I'd consider removing that part. It complicates the description a lot, compared to just saying "we'd run FitsAnyNode/FitsAnyNodeMatching" and than having this section say "if we have performance problems we can use run predicate checking from a pointer that moves cyclically over the list of nodes. This way we decrease the expected number of Filters() call needed for each pod."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I dropped most of the target_node_pointer description and mentioned it only as a potential future optimization.
represents a state in which CA decides to skip scale down logic due to ongoing | ||
deletion of a single non-empty node. In the new algorithm, this status will only | ||
be emitted when max parallelism for both empty and non-empty nodes was reached | ||
and hence no new deletion can happen. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @towca
This makes sense to me, but I think you have more expertise here than I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good and excited to see this, some general comments/queries from a first pass through, though I need to spend some more time on the algorithm and target_node_pointer.
|
||
## Rollout | ||
|
||
Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we expect to see a significant difference in behaviour between this algorithm and the existing behaviour if setting --max-parallel-drain=1
once the old logic has been dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully not, unless there is a bug. I think the main difference would be that bulk deletion of empty nodes will start happening in the same loop: currently if CA removes any empty nodes, it won't attempt a drain. However, this shouldn't really affect users in any visible way.
Btw, updated the flag name & condition, I guess drain parallelism == 1 is what should be used for the old logic initially.
|
||
Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old | ||
logic will stay for a while to make rollback possible. The flag will be | ||
introduced in version 1.X (probably 1.24), default it to non-0 in 1.(X+1) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it reasonable to aim for 1.24 here given the upstream's already hit code freeze?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, 1.24 doesn't seem reasonable anymore, updated.
|
||
## Monitoring | ||
|
||
The existing set of metrics will suffice for the sake of monitoring performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any metrics we want to benchmark against to validate our intended improvements to the problem called out in the Background section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the rate of scaled_down_nodes_total should visibly improve once the new algorithm is implemented, especially in clusters that don't have too many empty nodes (just underutilized ones).
### Changes to ScaleDownStatus | ||
|
||
ScaleDownStatus does not play very well with the concept of multiple nodes being | ||
scheduled down. The existing ScaleDownInProgress value in ScaleDownResult enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we be better renaming ScaleDownInProgress
or introducing a new field, as if we're repurposing this existing status to our new purposes I don't think the name is accurate, potentially leading to confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ScaleDownInProgress - the scale down wasn't attempted, because a previous scale-down was still in progress.
I think technically this is going to still be correct, but maybe ScaleDownMaxParallelismReached
would be more accurate. @towca @MaciekPytel - any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok either way - I think the condition is analogous to the current one, especially if the current condition can show up if scale-down of empty nodes is rate limited (not sure if it can?). But ScaleDownMaxParallelismReached
is indeed much more clear and I see no reason not to add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, empty node deletions can currently pile up - if they take a lot of time, there will be no error and CA will keep on scheduling more node deletions. That probably should result in some scale down error though.
Rate limiting is a slightly different mechanism than the one I'm proposing here. To avoid naming the status after implementation I decided to use ScaleDownThrottled
instead, so we're free to change the throttling mechanism in the future without affecting the status type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the hard work here @x13n , this is very thorough and reads well to me. i think this is a good feature to add and i like the approach of creating an interface to allow for easy switching between the implentations. i do have a couple questions:
what are your thoughts about the candidate_set
with respect to an unexpected restart of the cluster autoscaler? (eg what type of effect would losing the cached data have)
from the monitoring perspective, does this have any effect on the status configmap where the autoscaler records recent activity?
Thanks @elmiko for reviewing!
As far as I can tell, the only scaledown-related information passed to the status configmap is about scale down candidates (updated via |
thanks for the answers @x13n , i'm generally +1 on this |
* Iterate over all the pods in recent\_evictions | ||
* If a pod was added to the list N or more seconds ago, remove it from | ||
recent\_evictions | ||
* If a pod has a known owner reference, check if parent object has created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe further improve this by grouping by owner ref and checking if (target_replicas - current_replicas) > recent_evictions(controller). For example if we've evicted 4 pods from a given deployment, but the deployment is only missing 1 replica we could keep just one of those on the list.
This is just pointing out a possible future improvement. I don't think this is required from initial implementation and this comment doesn't need to be addressed in any way in this design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, good point. Updated the doc so this idea doesn't get lost.
* pod\_destination\_hints: | ||
* stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation. | ||
* map: source node names -> pod UID -> destination node names | ||
* previous\_node\_infos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.
/lgtm @gjtempleton Are you satisfied with the current state of this PR or do you think it needs more iteration? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
Since there were no more comments for over a week, I'll just go ahead and cancel the hold. I'm happy to address further comments in separate PRs if needed. /hold cancel |
How or where do we keep track of the progress of parallel drain feature. This is very exciting. Love to contribute to testing. |
Ah, good point, there was no tracking issue, so I just created one: #5079 I'll update the issue once there is something that can be tested. |
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a proposal for draining multiple nodes in parallel, which will improve scale down performance in large clusters.
Which issue(s) this PR fixes:
n/a
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: