Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for parallel drain #4766

Merged
merged 1 commit into from
Apr 21, 2022
Merged

Conversation

x13n
Copy link
Member

@x13n x13n commented Mar 25, 2022

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a proposal for draining multiple nodes in parallel, which will improve scale down performance in large clusters.

Which issue(s) this PR fixes:

n/a

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 25, 2022
@x13n x13n force-pushed the scaledown branch 2 times, most recently from 8a95da6 to 681e130 Compare March 28, 2022 08:17
@MaciekPytel
Copy link
Contributor

I left some comments. Most of them are just asking for clarification of some details that weren't obvious to me.

Overall the proposal looks good to me. I know it was brought up on sig-meeting, but given the size of proposed changes I'd like to give one more shout out to other owners. Any thoughts on this proposal @feiskyer @towca @mwielgus @gjtempleton ?

@x13n
Copy link
Member Author

x13n commented Apr 1, 2022

Thanks for reviewing! I don't see the comments though, did you forget to send them?

Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, forgot to submit my review with all the comments

cluster-autoscaler/proposals/parallel_drain.md Outdated Show resolved Hide resolved
cluster-autoscaler/proposals/parallel_drain.md Outdated Show resolved Hide resolved

In each iteration the contents of candidate\_set will be updated. Nodes can be added, removed or stay in candidate\_set. For each node in the candidate\_set we will keep track of how long it is in there. If node is removed and then re-added to candidate set the timer is reset.

To trigger node deletion will use the already present ScaleDownUnneededTime and scaleDownUnreadyTime parameters. If in given CA loop iteration there are nodes which have been in candidate\_set for more than ScaleDownUnneededTime (or is not ready and is in candidate\_set for more than scaleDownUnreadyTime), the actuation of scaledown is triggered. The number of nodes which are actively scaled down will be limited by MaxDrainParallelism and MaxScaleDownParallelism configuration parameters. We will configure separate limits for scaling down empty and non-empty nodes to allow quick scale-down of empty nodes even if lengthy drains of non-empty nodes are in progress.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly max-empty-bulk-delete flag will not be used by the new algorithm and it's role will effectively be filled by --max-scale-down-parallelism? If so is that something that should be stated explicitly? Would it make sense to default --max-scale-down-parallelism value to --max-empty-bulk-delete while we deprecated max-empty-bulk-delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, mentioned this is the doc.


The existing scale down code lives mostly in the ScaleDown object, spanning scale\_down.go file with 1.5k lines of code. As a part of this effort, ScaleDown object will undergo refactoring, extracting utils to a separate file. ScaleDown itself will become an interface with two implementations (both relying on common utils): existing version and the new algorithm described below. This will allow easy switching between algorithms with a flag: different flag values will pick different ScaleDown interface implementations.

As a part of the refactoring, we will combine `FastGetPodsToMove` and `DetailedGetPodsForMove` into a single `PodsToMove` function. The `checkReferences` flag will be dropped and we will always do a detailed check. We are now using listers so doing a detailed check does not add extra API calls and should not add too much to execution time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for doing that. I wonder though: we currently do another simulation pass in TryToScaleDown. Is that needed after this change? It seems unclear how it could work with the new algorithm (presumably we'd need to repeat entire binpacking?) and it also feels pointless if we use the same PodsToMove implementation.

Am I correct in guessing you're planning to delete this part? If so maybe it's worth mentioning this more explicitly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in the new implementation we'll only need one simulation pass. Updated to state this explicitly.

cluster-autoscaler/proposals/parallel_drain.md Outdated Show resolved Hide resolved
* List of NodeInfo computed at the beginning of previous algorithm iteration.
* pdbs\_remaining\_disruptions - bookkeeping for graceful handling of PDBs
([details](#pdbs-checking)).
* recent\_evictions: set of pods that were recently successfully evicted from a node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this pods where the eviction request has been successful or just the ones where the pod is already gone? I'm wondering about the case of pod with long graceful termination - would it show up on the list? If not - should it also be included in "Verify the pods from nodes currently being scaled down can still be rescheduled" step below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods are only added to recent_evictions once they no longer running on a node. The two verification steps below are:

  • checking if pods that are still running on a (currently drained) node can be scheduled somewhere
  • checking if pods that are no longer running because they were evicted can be scheduled somewhere

recent_evictions exists only to check the second step. If a pod takes a lot of time to terminate, it will be verified in the first one.

* pod\_destination\_hints:
* stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation.
* map: source node names -> pod UID -> destination node names
* previous\_node\_infos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it? The doc mentions updating this variable, but I can't find any reference to actually using it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used to return unneeded nodes for deletion. The candidate_set is only a list of identifiers (node names).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was meant as metadata for subsequent fetching of unneeded nodes / candidate_set. Removed, since it is both not the greatest idea and too low-level to include here.


Method steps in pseudocode:

* Delete N empty nodes, up to MaxScaleDownParallelism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this say 'up to MaxScaleDownParallelism - len(deleted_set)'? Or do we intend to just delete up to MaxScaleDownParallelism nodes each loop? If so I'd rename the parameter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to include node deletions that started in previous iterations. Rephrased to make it clear.

target\_node\_pointer.

In the initial version, vanilla SchedulerBasedPredicateChecker will be used.
Implementing support for target\_node\_pointer is essentially an optimization
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in our offline discussion, I'd only implement it if we actually see performance issues. Personally, I'm not convinced that we'll see performance issues (the binpacking isn't all that different from what we do in FilterOutSchedulable) and, even if we do I'm not convinced this optimization will help all that much. I can see how it can limit the number of Filters() calls, but it doesn't change how many PreFilters() we need to run and I wouldn't be surprised if it turns out that PreFilters() are what really matters (especially with #4745).
Of course this is just guessing on my side and I may be proven completely wrong :)

That being said - a relatively large part of algorithm description covers updates of target_node_pointer. I'm not going to block this PR on it, but I'd consider removing that part. It complicates the description a lot, compared to just saying "we'd run FitsAnyNode/FitsAnyNodeMatching" and than having this section say "if we have performance problems we can use run predicate checking from a pointer that moves cyclically over the list of nodes. This way we decrease the expected number of Filters() call needed for each pod."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I dropped most of the target_node_pointer description and mentioned it only as a potential future optimization.

represents a state in which CA decides to skip scale down logic due to ongoing
deletion of a single non-empty node. In the new algorithm, this status will only
be emitted when max parallelism for both empty and non-empty nodes was reached
and hence no new deletion can happen.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @towca
This makes sense to me, but I think you have more expertise here than I do.

Copy link
Member

@gjtempleton gjtempleton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good and excited to see this, some general comments/queries from a first pass through, though I need to spend some more time on the algorithm and target_node_pointer.


## Rollout

Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we expect to see a significant difference in behaviour between this algorithm and the existing behaviour if setting --max-parallel-drain=1 once the old logic has been dropped?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully not, unless there is a bug. I think the main difference would be that bulk deletion of empty nodes will start happening in the same loop: currently if CA removes any empty nodes, it won't attempt a drain. However, this shouldn't really affect users in any visible way.
Btw, updated the flag name & condition, I guess drain parallelism == 1 is what should be used for the old logic initially.


Flag controlled: when --max-parallel-drain > 0, the new logic is enabled. Old
logic will stay for a while to make rollback possible. The flag will be
introduced in version 1.X (probably 1.24), default it to non-0 in 1.(X+1) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it reasonable to aim for 1.24 here given the upstream's already hit code freeze?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, 1.24 doesn't seem reasonable anymore, updated.


## Monitoring

The existing set of metrics will suffice for the sake of monitoring performance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any metrics we want to benchmark against to validate our intended improvements to the problem called out in the Background section?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rate of scaled_down_nodes_total should visibly improve once the new algorithm is implemented, especially in clusters that don't have too many empty nodes (just underutilized ones).

### Changes to ScaleDownStatus

ScaleDownStatus does not play very well with the concept of multiple nodes being
scheduled down. The existing ScaleDownInProgress value in ScaleDownResult enum
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be better renaming ScaleDownInProgress or introducing a new field, as if we're repurposing this existing status to our new purposes I don't think the name is accurate, potentially leading to confusion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ScaleDownInProgress - the scale down wasn't attempted, because a previous scale-down was still in progress.

I think technically this is going to still be correct, but maybe ScaleDownMaxParallelismReached would be more accurate. @towca @MaciekPytel - any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok either way - I think the condition is analogous to the current one, especially if the current condition can show up if scale-down of empty nodes is rate limited (not sure if it can?). But ScaleDownMaxParallelismReached is indeed much more clear and I see no reason not to add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, empty node deletions can currently pile up - if they take a lot of time, there will be no error and CA will keep on scheduling more node deletions. That probably should result in some scale down error though.

Rate limiting is a slightly different mechanism than the one I'm proposing here. To avoid naming the status after implementation I decided to use ScaleDownThrottled instead, so we're free to change the throttling mechanism in the future without affecting the status type.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2022
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 5, 2022
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hard work here @x13n , this is very thorough and reads well to me. i think this is a good feature to add and i like the approach of creating an interface to allow for easy switching between the implentations. i do have a couple questions:

what are your thoughts about the candidate_set with respect to an unexpected restart of the cluster autoscaler? (eg what type of effect would losing the cached data have)

from the monitoring perspective, does this have any effect on the status configmap where the autoscaler records recent activity?

@x13n
Copy link
Member Author

x13n commented Apr 8, 2022

Thanks @elmiko for reviewing!

what are your thoughts about the candidate_set with respect to an unexpected restart of the cluster autoscaler? (eg what type of effect would losing the cached data have)

candidate_set & non_candidate_set will be used for soft tainting. The soft tainting mechanism itself will stay the same: CA will remove the taints on startup and then rebuild a new candidate_set.

from the monitoring perspective, does this have any effect on the status configmap where the autoscaler records recent activity?

As far as I can tell, the only scaledown-related information passed to the status configmap is about scale down candidates (updated via UpdateScaleDownCandidates() call). This will remain unchanged, the new implementation will just dump the entire candidate_set there.

@elmiko
Copy link
Contributor

elmiko commented Apr 8, 2022

thanks for the answers @x13n , i'm generally +1 on this

* Iterate over all the pods in recent\_evictions
* If a pod was added to the list N or more seconds ago, remove it from
recent\_evictions
* If a pod has a known owner reference, check if parent object has created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe further improve this by grouping by owner ref and checking if (target_replicas - current_replicas) > recent_evictions(controller). For example if we've evicted 4 pods from a given deployment, but the deployment is only missing 1 replica we could keep just one of those on the list.

This is just pointing out a possible future improvement. I don't think this is required from initial implementation and this comment doesn't need to be addressed in any way in this design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, good point. Updated the doc so this idea doesn't get lost.

* pod\_destination\_hints:
* stores destination nodes computed during draining simulation. For each pod from nodes in candidate\_set and deleted\_set it keeps the destination node assigned during simulation.
* map: source node names -> pod UID -> destination node names
* previous\_node\_infos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we only really use it later in the same loop (we may technically keep it to next loop, but only because there is no obvious place to clear it and not much point doing so)? Or do we actually use it in the next loop in any way? I don't particularly like the idea of operating on old and potentially outdated nodeInfos and I don't see anything in the proposed algorithm that would require us to do so.

@MaciekPytel
Copy link
Contributor

/lgtm
/approve

@gjtempleton Are you satisfied with the current state of this PR or do you think it needs more iteration?
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 12, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2022
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2022
@MaciekPytel
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2022
@x13n
Copy link
Member Author

x13n commented Apr 21, 2022

Since there were no more comments for over a week, I'll just go ahead and cancel the hold. I'm happy to address further comments in separate PRs if needed.

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2022
@k8s-ci-robot k8s-ci-robot merged commit b848aa3 into kubernetes:master Apr 21, 2022
@bbhenry
Copy link

bbhenry commented Aug 8, 2022

How or where do we keep track of the progress of parallel drain feature. This is very exciting. Love to contribute to testing.

@x13n
Copy link
Member Author

x13n commented Aug 9, 2022

Ah, good point, there was no tracking issue, so I just created one: #5079

I'll update the issue once there is something that can be tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants