-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per nodegroup scale-down config #3789
Per nodegroup scale-down config #3789
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
b8257de
to
bca80ad
Compare
/hold cancel |
cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor_test.go
Outdated
Show resolved
Hide resolved
MaxEmptyBulkDelete int | ||
// NodeGroupAutoscalingOptions contain various options to customize how autoscaling of | ||
// a given NodeGroup works. Different options can be used for each NodeGroup. | ||
type NodeGroupAutoscalingOptions struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have a dumb question but:
- Why only these 4 scale down parameters ?
ScaleDownDelayAfterAdd
,ScaleDownDelayAfterDelete
and so on, are not applicable for this new configuration method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those 4 values were the low-hanging fruits - they required relatively small changes in the code and they were by far the most commonly requested values to control per nodegroup.
I think ultimately it makes sense to have ScaleDownDelayAfter* as part of NodeGroupAutoscalingOptions, but it is much more tricky to do (details below) and I didn't want to block this PR on it. I'd rather take an iterative approach of gradually moving more options to NodeGroupAutoscalingOptions in separate PRs (though I don't want to make any promises on when I will be able to actually implement it for ScaleDownDelay flags; I'm also happy to just review if someone else feels like taking a shot at it).
Details:
ScaleDownDelayAfter* are currently implemented as a global check that result in completely skipping scale-down logic] and not as a per-node check like the 4 values in the new config. Having them per-nodegroup would require completely re-implementing them and figuring out some non-obvious questions about semantics: how should per-nodegroup ScaleDownDelayAfterDelete work? It obviously blocks one nodegroup from scaling-down, but should it be triggered only when node from that nodegroup is deleted or when any node in cluster is deleted? Same goes for ScaleDownDelayAfterAdd. It seems to me that depending on use-case you may want either behavior (block on any node - if your pods can schedule on multiple nodegroups and you just want CA to pick the best one, only block on that nodegroup if you have a dedicated nodegroup per workload and each pod can generally only go to 1 nodegroup).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see, it's rather clear to me that CA should be first updated with "simple" modification first.
As for ScaleDownDelayAfter*
, I understand the impact and the amount of source code edit behind, and the semantics questions. I may have a "dumb" solution but : what about adding a ScaleDownAfterBehaviour
, with global
effect and local
or something ?
Anyway, thanks for the hard work, I really appreciate the effort to add some per node-group config so quickly, it will really boost our autoscaling configuration for our managed solution 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a dumb solution. I was considering a pretty much equivalent solution - keep the current scale-down-delay-after-* flags and add scale-down-delay-after-node-group-add/delete/etc (the name could use some work).
This gives maximum flexibility at the cost of adding even more complexity for the user. I guess it may not be a problem if you have a hosted solution, but configuring CA is already pretty hard partly due to a huge number of flags we have. Still it may be the best solution.
Whatever solution we decide on I'd rather do it in a separate PR though. I think this one is complex enough as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever solution we decide on I'd rather do it in a separate PR though. I think this one is complex enough as is.
And I agree :)
cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor_test.go
Show resolved
Hide resolved
cluster-autoscaler/processors/nodegroupconfig/node_group_config_processor_test.go
Outdated
Show resolved
Hide resolved
@towca Addressed your comment. I put the changes in a separate commit, since the fixes apply to different commits and amending them all would get tricky. |
Thanks for addressing the comments! LGTM, but could you take a look at the second part of the remaining comment? If it's ok feel free to remove the hold. /lgtm |
This is the first step of implementing kubernetes#3583 (comment). New method was added to cloudprovider interface. All existing providers were updated with a no-op stub implementation that will result in no behavior change. The config values specified per NodeGroup are not yet applied.
This is the implementation of kubernetes#3583 (comment).
f05df85
to
c8e527b
Compare
/lgtm |
c8e527b
to
65b3c8d
Compare
/lgtm |
This PR implements #3583 (comment). It allows NodeGroups to override ScaleDownUnneededTime, ScaleDownUnreadyTime and utilization thresholds for nodes in that NodeGroup.
A new GetOptions method is added to cloudprovider interface. Its implementation is optional - returning ErrNotImplemented will result in the current behavior of using global config for all NodeGroups. A stub implementation returning ErrNotImplemented was added to all providers.
Note that this PR is only a framework for implementing the actual feature - there is no implementation for how the user would specify the options. Any provider is free to implement GetOptions (and the way to configure it) as they like.