Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scale-down for better integration with drainability rules #6135

Merged

Conversation

artemvmin
Copy link
Contributor

@artemvmin artemvmin commented Sep 23, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is one of several CLs to move all drain conditions to drainability rules. Once complete, clients implementing custom drainability rules will have full control over the scale-down of nodes.

Notable changes:

  • Split out drainability rules into separate packages. simulation/drainability:Rule.Drainable() function now takes a DrainContext. This function can assume that DrainContext is not nil.
  • Refactor NodeDeleteOptions for reuse in the drainability package. simulation:NodeDeleteOptions has been split into two structs: simulation/options:NodeDeleteOptions and drainability/rules:Rules. Consumers of this struct have been updated accordingly.

Does this PR introduce a user-facing change?

None

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @artemvmin!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 23, 2023
@x13n
Copy link
Member

x13n commented Sep 26, 2023

/assign

@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch 7 times, most recently from fe5cf63 to aba1391 Compare September 27, 2023 06:54
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 27, 2023
@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch 2 times, most recently from c162820 to e53147e Compare September 27, 2023 07:35
@x13n
Copy link
Member

x13n commented Sep 27, 2023

/cc @olagacek

@k8s-ci-robot
Copy link
Contributor

@x13n: GitHub didn't allow me to request PR reviews from the following users: olagacek.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @olagacek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cluster-autoscaler/simulator/drain.go Outdated Show resolved Hide resolved
@@ -106,39 +87,6 @@ func GetPodsToMove(nodeInfo *schedulerframework.NodeInfo, deleteOptions NodeDele
if err != nil {
return pods, daemonSetPods, blockingPod, err
}
if pdbBlockingPod, err := checkPdbs(pods, pdbs); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By moving this check before GetPodsForDeletionOnNodeDrain, you're changing the logic - it will now operate on a different set of pods. In particular, it will start to check PDBs for DS pods, which I think doesn't make sense - we don't want to block node removal on this. I think rewriting GetPodsForDeletionOnNodeDrain into drainability rules first (as mentioned in TODO above) would be a safer approach, since then you could preserve the ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I became aware of this ordering issue early on and decided to turn this PR into the full story. I reordered the commits and updated the title. I'll ping for a follow-up review when the remainder is implemented.

// require adding information to the DrainContext, such as the slice of pods
// and a flag to prevent duplicate checks.
for _, pdb := range drainCtx.Pdbs {
selector, err := metav1.LabelSelectorAsSelector(pdb.Spec.Selector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation we're doing the same conversion O(N*M) times instead of O(N) as before. (Where N is the number of PDBs and M is the number of pods.) Could we keep selectors, rather than just raw pdbs, in the context? Or - even better - reuse RemainingPdbTracker which already operates in that way?

Copy link
Contributor Author

@artemvmin artemvmin Sep 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a separate draft passing RemainingPdbTracker directly, but without using DrainCtx. The combination of these two ideas seems to be the sweet spot. Initially didn't follow through with it because I got scared of the asynchronous go-routines using the NodesToRemove function (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go#L219) and the lack of multi-thread safety in the RemainingPdbTracker object. Do you think this is an issue?

I added the refactor to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good question. As far as I can tell though, you're using a new instance of the tracker in each such goroutine. This is perhaps suboptimal, but should be safe.

cluster-autoscaler/main.go Outdated Show resolved Hide resolved
@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch 2 times, most recently from 84ad488 to 98effe0 Compare September 27, 2023 18:23
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 27, 2023
@artemvmin artemvmin changed the title Convert scale-down pdb check to drainability rule Convert scale-down checks to drainability rules Sep 27, 2023
@x13n
Copy link
Member

x13n commented Sep 28, 2023

The current refactor looks good to me now. Are you sure you want to rewrite all the checks in one humongous PR? You may start getting into merge conflicts, so I'd suggest following up in a separate PR - WDYT?

Btw, in the release notes you're adding some actions required - while this is true for downstream/forked code, it may be confusing for OSS CA release notes (which is the intended audience for these). I don't think this change should have any user-visible changes.

@artemvmin artemvmin changed the title Convert scale-down checks to drainability rules Refactor scale-down for better integration with drainability rules Sep 28, 2023
@artemvmin
Copy link
Contributor Author

artemvmin commented Sep 28, 2023

The current refactor looks good to me now. Are you sure you want to rewrite all the checks in one humongous PR? You may start getting into merge conflicts, so I'd suggest following up in a separate PR - WDYT?

Sounds good. Thanks for the tip. I updated the PR title.

Btw, in the release notes you're adding some actions required - while this is true for downstream/forked code, it may be confusing for OSS CA release notes (which is the intended audience for these). I don't think this change should have any user-visible changes.

That makes sense. Updated.

Please review.

@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 98effe0 to 8f2532a Compare September 28, 2023 23:48
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2023
@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 8f2532a to 5a2f46e Compare September 28, 2023 23:52
@@ -106,7 +90,7 @@ func GetPodsToMove(nodeInfo *schedulerframework.NodeInfo, deleteOptions NodeDele
if err != nil {
return pods, daemonSetPods, blockingPod, err
}
if pdbBlockingPod, err := checkPdbs(pods, pdbs); err != nil {
if pdbBlockingPod, err := checkPdbs(pods, remainingPdbTracker.GetPdbs()); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you could already delete checkPdbs function now and just use RemainingPdbTracker.CanRemovePods().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I wasn't sure what the "legacy scale-down" was referring to in the comment below so I left it alone. The logic looks identical if the parallel return value is ignored.

// require adding information to the DrainContext, such as the slice of pods
// and a flag to prevent duplicate checks.
for _, pdb := range drainCtx.Pdbs {
selector, err := metav1.LabelSelectorAsSelector(pdb.Spec.Selector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good question. As far as I can tell though, you're using a new instance of the tracker in each such goroutine. This is perhaps suboptimal, but should be safe.

@x13n
Copy link
Member

x13n commented Sep 29, 2023

/lgtm
/approve
/hold

I only really have one minor comment, but it doesn't block this PR, so feel free to cancel the hold if you don't want to address it now.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 29, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: artemvmin, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 29, 2023
@artemvmin
Copy link
Contributor Author

Comments addressed.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 29, 2023
@x13n
Copy link
Member

x13n commented Sep 29, 2023

Thanks!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
@artemvmin artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 5892f72 to 9ea5a36 Compare September 29, 2023 17:55
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
@x13n
Copy link
Member

x13n commented Sep 29, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023
@k8s-ci-robot k8s-ci-robot merged commit 2f7c61e into kubernetes:master Sep 29, 2023
@artemvmin artemvmin deleted the pdb-drainability-rule-dynamic branch September 29, 2023 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants