Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster-autoscaler: override by annotation node scale down time #2989

Closed

Conversation

JulienBalestra
Copy link
Contributor

@JulienBalestra JulienBalestra commented Mar 27, 2020

This PR allows to use specific scale down threshold to easily manage a heterogeneous number of nodegroups.

metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/scale-down-unneeded-time: "1h"
    cluster-autoscaler.kubernetes.io/scale-down-unready-time: "2h"

For example, when using the storage-local-static-provisioner and statefulsets, you don't want that the node missing stateful pods being scaled down with the same time as the stateless ones.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign vivekbagade
You can assign the PR to them by writing /assign @vivekbagade in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@JulienBalestra JulienBalestra force-pushed the jb/unneeded-annotation branch from d7c482f to 89e8447 Compare March 27, 2020 13:24
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2020
@JulienBalestra
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2020
@JulienBalestra
Copy link
Contributor Author

@feiskyer @losipiuk would it be possible to get a feedback on this PR/use-case ?

@MaciekPytel
Copy link
Contributor

Sorry, I missed that one earlier.

As a go-to solution I would prefer allowing to override those and other settings where it makes sense on a per-nodegroup basis. I was thinking about similar use-case recently and my ideas was to of extend NodeGroup interface with a method that would allow returning struct with values such as scaleDownUnneededTime or scaleDownUtilizationThreshold (using top-level settings as backup when not available or not implemented).
The benefit of that solution is that I think it would be more convenient for the user:

  1. It would only require to be configured once per nodegroup rather than having to put an annotation on each node.
  2. I believe that various versions of nodegroup autodiscovery are the most popular way of configuring CA. With the override being done via cloudprovider module it would be easy to extend each providers implementation of autodiscovery to include those additional parameters and have all the config related to node group in one place.
    • Technically speaking the same applies to providers using --nodes flag to configure node groups - the syntax of this flag is already provider-specific and can be extended to cover this case.

The downside would be that making it provider specific would result in inconsistencies between providers, but I feel it's too late for that concern - configuring CA already wildly differs between providers.

Would my solution also cover your use-case? Or do you actually want to be able to configure it on a per-node basis?


Somewhat unrelated to the discussion on the API - a possible problem with having different scaleDownUnneededTime for different nodes is that CA only keeps a certain number of nodes as 'scale-down candidates'. Without any preference for those with shorter scaleDownUnneededTime there are edge cases where long-waiting candidates could effectively block scale-down. It would be worth trying to think through other edge cases that may arise from making various configs per-nodepool, but I don't think it should block adding the configuration knobs.

@JulienBalestra
Copy link
Contributor Author

JulienBalestra commented Sep 14, 2020

@maciaszczykm thank you for the answer and your detailed proposal.

On our end, we prefer to defer the logic on Kubernetes objects like nodes to let the user experience be customisable (this PR change).

I observe the following advantage to go with this original proposal:

  • node annotation is consistent with cluster-autoscaler.kubernetes.io/scale-down-disabled=true
  • logic could be handled by specific in cluster controllers just scoped to nodes instead of having to implement that several times in different cloud providers
  • threshold could dynamically change per node granularity instead of per nodegroup - this could help in the future to reduce our number of ASG/MIG/VMSS
  • thresholds are easier to understand when visibles inside the Kubernetes cluster
  • code base refactoring is minimal - however this is not a strong valuable argument imo

As both make sense, could we potentially implement both or would that be too misleading ?

@marwanad
Copy link
Member

@MaciekPytel @JulienBalestra I was thinking about a similar use case for scaleDownUtilizationThreshold and came across this thread. For the configuration side of things, is there any concern on bringing the dynamic config back? I understand that autodiscovery "won" for configuration of nodegroups but I feel if we're extending the nodegroup flags to include things outside name, min and max - it's a much cleaner interface and UX to have those configured via a configmap in a similar fashion to:

{
	"nodeGroups": [
	  {
		"minSize": 1,
		"maxSize": 100,
		"name": "pool-name",
                "scaleDownUtlizationThreshold": 0.7
	  }
	],
	"autoScalerSettings": {
		"scan-interval": "60s"
	}
  }

than a provider-specific interpretation of autodiscovery. I believe that not all cloud-providers have implementations of autodiscovery.

@MaciekPytel
Copy link
Contributor

MaciekPytel commented Oct 19, 2020

Sorry, I've missed your replies. I don't particularly like annotations for the following reasons:

  1. You can't use the feature without writing some software (reconcilling controller or mutating admission webhook?). In an autoscaled cluster you'd expect those nodes to have been created by autoscaler and there is just nothing that would automatically set those annotations. I appreciate that it may be the best solution for your particular use-case, but I don't think there will be many other users who'd be able to use it.
  2. There is a lot of different values that are currently global, but are checked for each node - scale down thresholds mentioned by @marwanad are a good example. This means we'd need a lot of those annotations, which I think is problematic:
  • The logic gets spread all over the place (we'd need to check for each annotation in the place we use it). With one source of config you can keep all the logic encapsulated.
  • Annotations are ~impossible to deprecate. With a central config you can at least crash early with some sort of error.
  1. threshold could dynamically change per node granularity instead of per nodegroup - This is debatable, but I actually see this as a downside. NodeGroup is an abstraction for a group of identical nodes. Allowing mixed config for some settings when we can't support it for other complicates things.

Regarding configmap: You make good points about having a consistent implementation between providers and the fact that some don't have autodiscovery. The problems with this approach that I see:

  • The user would need to know the id of each NodeGroup to write config.
    • I feel this is problematic for the case of clusters that are automatically created by some sort of tooling. It's not difficult to include the autodiscovery tags when you create your ASG/MIG/etc using something like terraform template. Generating configmap automatically seems much more complicated.
    • How cloudprovider chooses to assign ids NodeGroups is an implementation detail. With configmap it becomes an implicit interface.
    • There is no requirement for NodeGroups to match 1-1 any underlying cloudprovider abstraction. For example a multi-zonal GKE nodepool is implemented as multiple separate NodeGroups. And for an on-prem implementation you may not have any underlying object that maps to NodeGroup - it could just be an internal grouping of similar VMs.

All those problems apply to any in-cluster config and I'm not sure how we can easily solve them. Maybe there is some sort of a middle ground solution here where we can have a cloudprovider specific mechanism, but provide a default implementation using configmap? Though it seems a bit ugly.

edit: On implementation side - I don't believe the old implementation of dynamic config has ever worked. It was contributed by an external contributor in the early days of the project and it was never maintained, IIRC it had some breaking bugs that never got fixed. I don't believe CA can handle recreating StaticAutoscaler (probably even more so now that we encourage people to use stateful processors in forks), even if we decide to go with confimap I think it's going to be much easier and safer to implement it by updating the settings inside RunOnce (similarly to how we update settings in cloudprovider.Refresh()).

@mwielgus
Copy link
Contributor

mwielgus commented Dec 7, 2020

Closing due to inactivity. Feel free to reopen.

@mwielgus mwielgus closed this Dec 7, 2020
@ghost
Copy link

ghost commented Nov 8, 2021

Is there a way to achieve this (different scale-down times per nodegroups) currrently?

@MaciekPytel
Copy link
Contributor

Depends on where you're running your cluster. Core autoscaler logic supports it, but not all providers have integrated with this logic and many only added integration recently (ex. GCE, Azure support is coming in 1.23 I think, #4238 adding AWS support is pending).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants