Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min capacity draft #62

Closed
wants to merge 1 commit into from
Closed

Conversation

grosser
Copy link

@grosser grosser commented Nov 9, 2022

feat: allow setting minimum capacity per provisioner
fix: #749

Description
rough draft of how I think having a min capacity could work

... alternatively this could be done via "min nodes" setting and then computing that times the highest (or average) cpu of the matching instance types, but means more indirection

How was this change tested?
not

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mattsre
Copy link

mattsre commented Nov 9, 2022

If I understand it correctly this would only provision capacity up to whatever minimum was defined? If I have an existing provisioner which is already managing 20cpu of compute and I set minCpu to 4cpu, it would effectively do nothing right? I think this proposal is a bit too restrictive, and wouldn't support some of the ideas from #749 like "prewarming" easily (I'm of course biased because my need wouldn't be solved by this proposal either 😅).

If we implement headroom using similar code to what you have here and re-using my example from above, by setting cpuHeadroom to 4cpu with an existing capacity of 20cpu, we could check how much of the 20cpu we managed is reserved. Let's say 18cpu of the 20cpu capacity is reserved. From there we can determine we need to provision a new node with 2cpu to increase our total capacity to 22cpu to cover our 4cpu headroom. Something like this could also handle your minCpu usecase. If the provisioner is currently managing 0cpu and we set headroom to 10cpu, its functionally equivalent to minCpu. We'd identify the difference and provision new nodes until we had at least 10cpu of headroom. I will admit a fully-baked headroom implementation is more complicated to implement and performance intensive than something like the minCpu strategy.

@grosser
Copy link
Author

grosser commented Nov 9, 2022

yeah, our need was to have a min capacity, not a buffer
... so unless we find some clever in-between thing we might need both ...

@ellistarn
Copy link
Contributor

Thoughts on building this capability via a separate CRD? I see this usecase as fundamentally different. Static Capacity vs Dynamic Capacity.

@grosser
Copy link
Author

grosser commented Nov 10, 2022

I'd not want to create a new CRD since this

  • seems like an option for the provisioner
  • the sheer overhead of managing a new CRD and getting that into the codebase seems overwhelming to me
  • I want to mirror the previous ASG workload as much as possible so having it behave in a similar way would be nice (that is for the min capacity, does not apply to the overhead capacity)

@ellistarn
Copy link
Contributor

This is an interesting idea to as a similar mechanism to "limits". Essentially, we have a ceiling today, but not a floor. What type of capacity would you expect for the floor. How big should the nodes be? Which zones should they be launched in?

We'd love to collaborate on this. Can you ping me on slack and we can discuss next steps?

@grosser
Copy link
Author

grosser commented Nov 10, 2022

  • type: we'd set this based on cpu cores, but it might be simpler to say "number of nodes" because then we don't need to do all the cpu/memory math
  • nodes: same as specified in the provider (if nothing is specified that biggest instance type) ... we use a list of instance types and would want the first (preferred type)

... so maybe doing minNodes is simpler and then it creates a fake pod in the size of the preferred node (minus room for daemonsets etc)
that would mirror the ASG behavior nicely

@ellistarn
Copy link
Contributor

How does this interact with consolidation / expiry?

@grosser
Copy link
Author

grosser commented Nov 10, 2022

The extra nodes would expire same as all others.
For consolidation it would need to do the same "fake pods" logic to not decide to remove the extra nodes.

@ellistarn
Copy link
Contributor

One thing we discussed in working group was using negative preemption pods to solve this problem. These pods would be included in scale out calculations, but wouldn't themselves trigger scale out unless there were positive preemption pods. Similarly, we wouldn't consolidate nodes with negative preemption pods, but we would terminate a node if it only had negative preemption pods.

@grosser
Copy link
Author

grosser commented Nov 11, 2022

That sounds good to me.
Do you have a preferred direction in terms of how many/how big these pods should be ?

@solidnerd
Copy link

I also thought about the problem yesterday since we would have something like pre-warming nodes and schedule them so I can say from Mo-Fr minSize=2 and Sa-Su minSize=0.

My initial idea was to use a Kubernetes cronjob, placing a pause container with the fitting requests for CPU and Memory for the workload and have a configurable shutdown of the pause container (this is similar to pre-emption pods)

To be able for the perfect fitting, we need the option to specify the CPU and Memory for the fake pods.

@grosser
Copy link
Author

grosser commented Nov 11, 2022

Karpenter knows how much cpu/mem daemonsets use so it could schedule up to the biggest possible pods to fit onto the biggest supported instance type.
(so 50 cpus + 16xlarge -> just 1 pod)

@ellistarn
Copy link
Contributor

Daemonset are definitely an interesting way to do node overhead! @tzneal

@skillcoder
Copy link

skillcoder commented Dec 27, 2022

It is not enough to have ReservedCPU (MinCpu), also need at least ReservedMemory as some workload can use lot's of memory and not much CPU.
And of course it should be configured per "application group".
Also this feature should have ability to work with/like KEDA, to track some metrics and scale this values accordingly.
As reserved values can/will be dynamic.
For example should be higher during huge update waves, or during massive scaling events.
And lower during lower hours at night.

Without implementing all these requirements better to stay on overprovisioning dirty hack and manage it outside of Karpenter as for now it's more flexible.

@grosser
Copy link
Author

grosser commented Dec 27, 2022

  • min-memory only needs to be considered when using different instance families
  • per application group: this could be per provisioner
  • keda like features were not in scope for me and don't "need" to be added

@ellistarn
Copy link
Contributor

ellistarn commented Dec 27, 2022

I think this change warrants a design. See https://karpenter.sh/v0.20.0/contributing/design-guide/

@njtran
Copy link
Contributor

njtran commented Mar 2, 2023

Hey @grosser, thanks a ton for discussing with us on this PR. Unfortunately I don't think we'll be accepting this PR, due to the amount of design unknowns that come with this. There's a lot of moving parts in the deprovisioning logic where adding static capacity could create some funky circumstances. All in all, if you're willing to help work on this design, please come talk with us in https://kubernetes.slack.com/archives/C04JW2J5J5P, we're more than willing to help get you up to speed on the current state of the project.

@njtran njtran closed this Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mega Issue: Manual node provisioning
6 participants