Design proposal for parallel drain #4766

x13n · 2022-03-25T14:20:57Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a proposal for draining multiple nodes in parallel, which will improve scale down performance in large clusters.

Which issue(s) this PR fixes:

n/a

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

MaciekPytel · 2022-04-01T15:46:37Z

I left some comments. Most of them are just asking for clarification of some details that weren't obvious to me.

Overall the proposal looks good to me. I know it was brought up on sig-meeting, but given the size of proposed changes I'd like to give one more shout out to other owners. Any thoughts on this proposal @feiskyer @towca @mwielgus @gjtempleton ?

x13n · 2022-04-01T18:25:32Z

Thanks for reviewing! I don't see the comments though, did you forget to send them?

MaciekPytel

Sorry, forgot to submit my review with all the comments

cluster-autoscaler/proposals/parallel_drain.md

MaciekPytel · 2022-04-01T11:04:49Z

cluster-autoscaler/proposals/parallel_drain.md

+
+In each iteration the contents of candidate\_set will be updated. Nodes can be added, removed or stay in candidate\_set. For each node in the candidate\_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset.
+
+To trigger node deletion will use the already present ScaleDownUnneededTime and scaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate\_set for more than ScaleDownUnneededTime (or is not ready and is in candidate\_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress.


If I understand this correctly max-empty-bulk-delete flag will not be used by the new algorithm and it's role will effectively be filled by --max-scale-down-parallelism? If so is that something that should be stated explicitly? Would it make sense to default --max-scale-down-parallelism value to --max-empty-bulk-delete while we deprecated max-empty-bulk-delete?

Good point, mentioned this is the doc.

MaciekPytel · 2022-04-01T13:03:45Z

cluster-autoscaler/proposals/parallel_drain.md

+
+The existing scale down code lives mostly in the ScaleDown object, spanning scale\_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations.
+
+As a part of the refactoring, we will combine `FastGetPodsToMove` and `DetailedGetPodsForMove` into a single `PodsToMove` function. The `checkReferences` flag will be dropped and we will always do a detailed check. We are now using listers so doing a detailed check does not add extra API calls and should not add too much to execution time.


+1 for doing that. I wonder though: we currently do another simulation pass in TryToScaleDown. Is that needed after this change? It seems unclear how it could work with the new algorithm (presumably we'd need to repeat entire binpacking?) and it also feels pointless if we use the same PodsToMove implementation.

Am I correct in guessing you're planning to delete this part? If so maybe it's worth mentioning this more explicitly.

Yes, in the new implementation we'll only need one simulation pass. Updated to state this explicitly.

cluster-autoscaler/proposals/parallel_drain.md

MaciekPytel · 2022-04-01T13:58:25Z

cluster-autoscaler/proposals/parallel_drain.md

+    *   List of NodeInfo computed at the beginning of previous algorithm iteration.
+*   pdbs\_remaining\_disruptions - bookkeeping for graceful handling of PDBs
+    ([details](#pdbs-checking)).
+*   recent\_evictions: set of pods that were recently successfully evicted from a node


Is this pods where the eviction request has been successful or just the ones where the pod is already gone? I'm wondering about the case of pod with long graceful termination - would it show up on the list? If not - should it also be included in "Verify the pods from nodes currently being scaled down can still be rescheduled" step below?

Pods are only added to recent_evictions once they no longer running on a node. The two verification steps below are:

checking if pods that are still running on a (currently drained) node can be scheduled somewhere

checking if pods that are no longer running because they were evicted can be scheduled somewhere

recent_evictions exists only to check the second step. If a pod takes a lot of time to terminate, it will be verified in the first one.

MaciekPytel · 2022-04-01T15:14:12Z

cluster-autoscaler/proposals/parallel_drain.md

+*   pod\_destination\_hints:
+    *   stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation.
+    *   map: source node names -> pod UID -> destination node names
+*   previous\_node\_infos


Why do we need it? The doc mentions updating this variable, but I can't find any reference to actually using it.

It is used to return unneeded nodes for deletion. The candidate_set is only a list of identifiers (node names).

So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.

Yeah, it was meant as metadata for subsequent fetching of unneeded nodes / candidate_set. Removed, since it is both not the greatest idea and too low-level to include here.

MaciekPytel · 2022-04-01T15:16:38Z

cluster-autoscaler/proposals/parallel_drain.md

+
+Method steps in pseudocode:
+
+*   Delete N empty nodes, up to MaxScaleDownParallelism.


Shouldn't this say 'up to MaxScaleDownParallelism - len(deleted_set)'? Or do we intend to just delete up to MaxScaleDownParallelism nodes each loop? If so I'd rename the parameter.

This is meant to include node deletions that started in previous iterations. Rephrased to make it clear.

MaciekPytel · 2022-04-01T15:29:10Z

cluster-autoscaler/proposals/parallel_drain.md

+target\_node\_pointer.
+
+In the initial version, vanilla SchedulerBasedPredicateChecker will be used.
+Implementing support for target\_node\_pointer is essentially an optimization


As I said in our offline discussion, I'd only implement it if we actually see performance issues. Personally, I'm not convinced that we'll see performance issues (the binpacking isn't all that different from what we do in FilterOutSchedulable) and, even if we do I'm not convinced this optimization will help all that much. I can see how it can limit the number of Filters() calls, but it doesn't change how many PreFilters() we need to run and I wouldn't be surprised if it turns out that PreFilters() are what really matters (especially with #4745).
Of course this is just guessing on my side and I may be proven completely wrong :)

That being said - a relatively large part of algorithm description covers updates of target_node_pointer. I'm not going to block this PR on it, but I'd consider removing that part. It complicates the description a lot, compared to just saying "we'd run FitsAnyNode/FitsAnyNodeMatching" and than having this section say "if we have performance problems we can use run predicate checking from a pointer that moves cyclically over the list of nodes. This way we decrease the expected number of Filters() call needed for each pod."

Fair enough, I dropped most of the target_node_pointer description and mentioned it only as a potential future optimization.

MaciekPytel · 2022-04-01T15:34:23Z

cluster-autoscaler/proposals/parallel_drain.md

+represents a state in which CA decides to skip scale down logic due to ongoing
+deletion of a single non-empty node. In the new algorithm, this status will only
+be emitted when max parallelism for both empty and non-empty nodes was reached
+and hence no new deletion can happen.


cc: @towca
This makes sense to me, but I think you have more expertise here than I do.

gjtempleton

Generally looks good and excited to see this, some general comments/queries from a first pass through, though I need to spend some more time on the algorithm and target_node_pointer.

gjtempleton · 2022-04-04T13:55:50Z

cluster-autoscaler/proposals/parallel_drain.md

+
+## Rollout
+
+Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old


Would we expect to see a significant difference in behaviour between this algorithm and the existing behaviour if setting --max-parallel-drain=1 once the old logic has been dropped?

Hopefully not, unless there is a bug. I think the main difference would be that bulk deletion of empty nodes will start happening in the same loop: currently if CA removes any empty nodes, it won't attempt a drain. However, this shouldn't really affect users in any visible way.
Btw, updated the flag name & condition, I guess drain parallelism == 1 is what should be used for the old logic initially.

gjtempleton · 2022-04-04T13:56:21Z

cluster-autoscaler/proposals/parallel_drain.md

+
+Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old
+logic will stay for a while to make rollback possible. The flag will be
+introduced in version 1.X (probably 1.24), default it to non-0 in 1.(X+1) and


Is it reasonable to aim for 1.24 here given the upstream's already hit code freeze?

Good point, 1.24 doesn't seem reasonable anymore, updated.

gjtempleton · 2022-04-04T13:57:20Z

cluster-autoscaler/proposals/parallel_drain.md

+
+## Monitoring
+
+The existing set of metrics will suffice for the sake of monitoring performance


Do we have any metrics we want to benchmark against to validate our intended improvements to the problem called out in the Background section?

I think the rate of scaled_down_nodes_total should visibly improve once the new algorithm is implemented, especially in clusters that don't have too many empty nodes (just underutilized ones).

gjtempleton · 2022-04-04T13:59:43Z

cluster-autoscaler/proposals/parallel_drain.md

+### Changes to ScaleDownStatus
+
+ScaleDownStatus does not play very well with the concept of multiple nodes being
+scheduled down. The existing ScaleDownInProgress value in ScaleDownResult enum


Would we be better renaming ScaleDownInProgress or introducing a new field, as if we're repurposing this existing status to our new purposes I don't think the name is accurate, potentially leading to confusion.

ScaleDownInProgress - the scale down wasn't attempted, because a previous scale-down was still in progress.

I think technically this is going to still be correct, but maybe ScaleDownMaxParallelismReached would be more accurate. @towca @MaciekPytel - any thoughts?

I'm ok either way - I think the condition is analogous to the current one, especially if the current condition can show up if scale-down of empty nodes is rate limited (not sure if it can?). But ScaleDownMaxParallelismReached is indeed much more clear and I see no reason not to add it.

Actually, empty node deletions can currently pile up - if they take a lot of time, there will be no error and CA will keep on scheduling more node deletions. That probably should result in some scale down error though.

Rate limiting is a slightly different mechanism than the one I'm proposing here. To avoid naming the status after implementation I decided to use ScaleDownThrottled instead, so we're free to change the throttling mechanism in the future without affecting the status type.

elmiko

thanks for the hard work here @x13n , this is very thorough and reads well to me. i think this is a good feature to add and i like the approach of creating an interface to allow for easy switching between the implentations. i do have a couple questions:

what are your thoughts about the candidate_set with respect to an unexpected restart of the cluster autoscaler? (eg what type of effect would losing the cached data have)

from the monitoring perspective, does this have any effect on the status configmap where the autoscaler records recent activity?

x13n · 2022-04-08T11:05:43Z

Thanks @elmiko for reviewing!

what are your thoughts about the candidate_set with respect to an unexpected restart of the cluster autoscaler? (eg what type of effect would losing the cached data have)

candidate_set & non_candidate_set will be used for soft tainting. The soft tainting mechanism itself will stay the same: CA will remove the taints on startup and then rebuild a new candidate_set.

from the monitoring perspective, does this have any effect on the status configmap where the autoscaler records recent activity?

As far as I can tell, the only scaledown-related information passed to the status configmap is about scale down candidates (updated via UpdateScaleDownCandidates() call). This will remain unchanged, the new implementation will just dump the entire candidate_set there.

elmiko · 2022-04-08T12:37:07Z

thanks for the answers @x13n , i'm generally +1 on this

MaciekPytel · 2022-04-12T11:23:04Z

cluster-autoscaler/proposals/parallel_drain.md

+    *   Iterate over all the pods in recent\_evictions
+    *   If a pod was added to the list N or more seconds ago, remove it from
+        recent\_evictions
+    *   If a pod has a known owner reference, check if parent object has created


We could maybe further improve this by grouping by owner ref and checking if (target_replicas - current_replicas) > recent_evictions(controller). For example if we've evicted 4 pods from a given deployment, but the deployment is only missing 1 replica we could keep just one of those on the list.

This is just pointing out a possible future improvement. I don't think this is required from initial implementation and this comment doesn't need to be addressed in any way in this design.

Yup, good point. Updated the doc so this idea doesn't get lost.

MaciekPytel · 2022-04-12T11:31:05Z

cluster-autoscaler/proposals/parallel_drain.md

+*   pod\_destination\_hints:
+    *   stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation.
+    *   map: source node names -> pod UID -> destination node names
+*   previous\_node\_infos


So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.

MaciekPytel · 2022-04-12T11:33:08Z

/lgtm
/approve

@gjtempleton Are you satisfied with the current state of this PR or do you think it needs more iteration?
/hold

k8s-ci-robot · 2022-04-12T11:33:34Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

MaciekPytel · 2022-04-13T13:00:48Z

/lgtm

x13n · 2022-04-21T10:16:36Z

Since there were no more comments for over a week, I'll just go ahead and cancel the hold. I'm happy to address further comments in separate PRs if needed.

/hold cancel

bbhenry · 2022-08-08T06:09:36Z

How or where do we keep track of the progress of parallel drain feature. This is very exciting. Love to contribute to testing.

x13n · 2022-08-09T13:50:36Z

Ah, good point, there was no tracking issue, so I just created one: #5079

I'll update the issue once there is something that can be tested.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 25, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer March 25, 2022 14:21

x13n force-pushed the scaledown branch 2 times, most recently from 8a95da6 to 681e130 Compare March 28, 2022 08:17

jbartosik added the area/cluster-autoscaler label Mar 28, 2022

MaciekPytel reviewed Apr 4, 2022

View reviewed changes

gjtempleton reviewed Apr 4, 2022

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2022

x13n force-pushed the scaledown branch from 951de5c to 6db8c68 Compare April 5, 2022 13:16

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 5, 2022

elmiko reviewed Apr 6, 2022

View reviewed changes

x13n force-pushed the scaledown branch from 6db8c68 to 2284649 Compare April 8, 2022 10:51

x13n mentioned this pull request Apr 11, 2022

Put ScaleDown logic behind an interface #4806

Merged

MaciekPytel reviewed Apr 12, 2022

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2022

k8s-ci-robot assigned MaciekPytel Apr 12, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022

Design proposal for parallel drain

48273b7

x13n force-pushed the scaledown branch from 2284649 to 48273b7 Compare April 12, 2022 13:16

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2022

k8s-ci-robot merged commit b848aa3 into kubernetes:master Apr 21, 2022

x13n mentioned this pull request Aug 9, 2022

Parallel drain in Cluster Autoscaler #5079

Closed


		In each iteration the contents of candidate\_set will be updated. Nodes can be added, removed or stay in candidate\_set. For each node in the candidate\_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset.

		To trigger node deletion will use the already present ScaleDownUnneededTime and scaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate\_set for more than ScaleDownUnneededTime (or is not ready and is in candidate\_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress.


		The existing scale down code lives mostly in the ScaleDown object, spanning scale\_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations.

		As a part of the refactoring, we will combine `FastGetPodsToMove` and `DetailedGetPodsForMove` into a single `PodsToMove` function. The `checkReferences` flag will be dropped and we will always do a detailed check. We are now using listers so doing a detailed check does not add extra API calls and should not add too much to execution time.


		Method steps in pseudocode:

		* Delete N empty nodes, up to MaxScaleDownParallelism.


		## Rollout

		Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old


		## Monitoring

		The existing set of metrics will suffice for the sake of monitoring performance

Design proposal for parallel drain #4766

Design proposal for parallel drain #4766

Conversation

x13n commented Mar 25, 2022

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

MaciekPytel commented Apr 1, 2022

x13n commented Apr 1, 2022

MaciekPytel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjtempleton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elmiko left a comment

Choose a reason for hiding this comment

x13n commented Apr 8, 2022

elmiko commented Apr 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Apr 12, 2022

k8s-ci-robot commented Apr 12, 2022

MaciekPytel commented Apr 13, 2022

x13n commented Apr 21, 2022

bbhenry commented Aug 8, 2022

x13n commented Aug 9, 2022