Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidation ttl: spec.disruption.consolidateAfter #735

Closed
runningman84 opened this issue Dec 19, 2022 · 87 comments · Fixed by #1453
Closed

Consolidation ttl: spec.disruption.consolidateAfter #735

runningman84 opened this issue Dec 19, 2022 · 87 comments · Fixed by #1453
Assignees
Labels
cost-optimization deprovisioning Issues related to node deprovisioning kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone

Comments

@runningman84
Copy link

Tell us about your request

We have a cluster where there are a lot of cron jobs which run every 5 minutes...

This means we have 5 nodes for our base workloads and every 5 minutes we get additional nodes for 2-3 minutes which are scaled down or consolidated with existing nodes.

This leads to a constant flow of nodes joining and leaving the cluster. It looks like the docker image pull and node initialization creates more network traffic fees than the cost reduction of not having running the instances all the time.

It would be great if we could configure some time consolidation period maybe together with ttlSecondsAfterEmpty which would only cleanup or consolidate nodes if the capacity was idling for x amount of time.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Creating a special provisioner is quite time consuming because all app deployments have to be changed to leverage it...

Are you currently working around this issue?

We think about putting cronjobs into a special provisioner which would not use consolidation but the ttlSecondsAfterEmpty feature.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@runningman84 runningman84 added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 19, 2022
@ellistarn ellistarn changed the title Consolidation with cronjobs Consolidation ttl: ttlSecondsAfterUnderutilized Dec 19, 2022
@ellistarn
Copy link
Contributor

We've talked about this a fair bit -- I think it should be combined/collapsed w/ ttlSecondsAfterEmpty.

@ellistarn
Copy link
Contributor

The challenge with this issue is more technical than anything. Computing ttlSecondsAfterEmpty is cheap, since we can cheaply compute empty nodes. Computing a consolidatable node requires a scheduling simulation across the rest of the cluster. Computing this for all nodes is really computationally expensive. We could potentially compute this once on the initial scan, and again once the TTL is about to expire. However, this can lead to weird scenarios like:

  • t0, node detected underutilized, enqueued to 30s TTL
  • t0+25s, pod added somewhere else in the cluster, making the node no longer consolidatable
  • t0+29s, pod removed somewhere else in the cluster, making the node consolidatable again, should restart TTL
  • t0+30s, node is consolidated, even though it's only been consolidatable for 1 second.

The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?

@kylebisley
Copy link

The only way to get the semantic to be technically correct is to recompute the consolidatability for the entire cluster on every single pod creation/deletion. The algorithm described above is a computationally feasible way (equivalent to current calculations), but has weird edge cases. Would you be willing to accept those tradeoffs?

I'm a little unclear on this and I think it's in how I'm reading not in what you've said. What I think I am reading is that running the consolidatability on every single pod creation/deletion is to expensive. As an alternative the algorithm above is acceptable but in some cases could result in node consolidation in 'less than' TTLSecondsAfterConsolidatable due to fluctuation in cluster capacity between initial check (t0) and confirmation check/ (t0+30s in the example).

Have I understood correctly?

@ellistarn
Copy link
Contributor

Yeah exactly. Essentially, the TTL wouldn't flip flop perfectly. We'd be taking a rough sample (rather than a perfect sample) of the data.

@kylebisley
Copy link

Thanks for the clarity. For my usage I'd not be concerned about the roughness of the sample. As long as there was a configurable time frame and the confirmation check needed to pass both times I'd be satisfied.

What I thought I wanted before being directed to this issue was to be able to specify how the consolidator was configured a bit like the descheduler project because I'm not really sure if the 'if it fits it sits' approach to scheduling is what I need in all cases.

@ellistarn
Copy link
Contributor

Specifically, what behavior of descheduler did you want?

@kylebisley
Copy link

Generally I was looking for something like the deschedulerPolicy.strategies config block which I generally interact through the helm values file.
More specifically I was looking for deschedulerPolicy.strategies.LowNodeUtalization.params.nodeResourceUtilizationThresholds targetThresholds and thresholds.

@ellistarn
Copy link
Contributor

@c3mb0
Copy link

c3mb0 commented Feb 9, 2023

To give another example of this need, I have a cluster that runs around 1500 pods - there are lots of pods coming in and out at any given moment. It would be great to be able to specify a consolidation cooldown period so that we are not constantly adding/removing nodes. Cluster Autoscaler has the flag --scale-down-unneeded-time that helps with this scenario.

@sichiba
Copy link

sichiba commented Feb 24, 2023

is it feature available yet?

@agustin-dona-peya
Copy link

We are facing same issue with high node rotation due too aggressive consolidation, would be nice to tune and control the behaviour, like minimum node ttl liveness, thresshold ttl since it's empty or underutilisation, merging nodes

@calvinbui
Copy link

cluster-autoscaler has other options too like:

--scale-down-delay-after-add, --scale-down-delay-after-delete, and --scale-down-delay-after-failure flag. E.g. --scale-down-delay-after-add=5m to decrease the scale down delay to 5 minutes after a node has been added.

I'm looking forward to something like scale-down-delay-after-add to pair with consolidation. Our hourly cronjobs are also causing node thrashing.

@tareks
Copy link

tareks commented Apr 13, 2023

Another couple of situations that currently lead to high node churn are:

  • A "high impact" rollout across various namespaces or workloads results in a large amount of resources being allocated. This spike in allocation is temporary but Karpenter will provision new nodes as a result. After the rollout is complete, capacity will normally return back to normally resulting in a consolidation attempt. This means node TTLs can be ~10-15 minutes depending on how long the rollout takes.
  • A batch of cronjobs that are scheduled at the same time and have a specific set of requests. This will also likely result in creating new node(s). Once the jobs are complete, there will likely be free capacity that will prompt Karpenter will consolidate.

In both situation above, we end up in situations where some workloads will end up being restarted multiple times within a short time frame due to node churn and if not enough replicas are configured with sufficient anti-affinity/skew, there is a chance for downtime to occur while pods become ready once again on new nodes.

It would be nice to be able to control the consolidation period, say every 24 hours or every week as described by the OP so it's less disruptive. Karpenter is doing the right thing though!

I suspect some workarounds could be:

  • simply provisioning additional capacity to accommodate rollouts
  • possible use of nodeSelectors for the scheduled jobs to run on without impacting other longer running workloads

Any other ideas or suggestions appreciated.

@thelabdude
Copy link

Adding here as another use case where we need better controls over consolidation, esp. around utilization. For us, there's a trade-off between utilization efficiency and disruptions caused by pod evictions. For instance, let's say I have 3 nodes, each utilized at 60%, so current behavior is Karpenter will consolidate down to 2 nodes at 90% capacity. But, in some cases, evicting the pods on the node to be removed is more harmful than achieving optimal utilization. It's not that these pods can't be evicted (for that we have the do-not-drain annotation) it's just that it's not ideal ... good example would be Spark executor pods that while they can recover from a restart, it's better if they are allowed to finish their work at the expense of some temporary inefficiency in node utilization.

CAS has the --scale-down-utilization-threshold (along with the other flags mentioned) and seems like Karpenter needs a similar tunable. Unfortunately, we're seeing so much disruption in running pods b/c of consolidation that we can't use Karpenter in any of our active clusters.

@FernandoMiguel
Copy link

@thelabdude can't your pods set terminationGracePeriodSeconds https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination ?

@thelabdude
Copy link

I'll have to think about termination grace period could help us but I wouldn't know what value to set and it would probably vary by workload ...

My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.

I'm not saying a utilization threshold is the right answer for Karpenter but the current situation makes it hard to use in practice because we get too much pod churn due to consolidation and our nodes are never empty, so turning consolidation off isn't a solution either.

@njtran
Copy link
Contributor

njtran commented May 8, 2023

Hey @thelabdude, this is a good callout of core differences in CA's deprovisioning and Karpenter's deprovisioning. Karpenter intentionally has chosen to not use a threshold, as for any threshold you create, due to the heterogenous nature of pod resource requests, you can create un-wanted edge cases that constantly need to be fine-tuned.

For more info, ConsolidationTTL here would simply act as a waiting mechanism between consolidation actions, which you can read more about here. Since this would essentially just be a wait, this will simply slow down the time Karpenter takes to get to the end state as you've described. One idea that may help is if Karpenter allows some configuration of the cost-benefit analysis that Consolidation does. This would need to be framed as either cost or utilization, both tough to get right.

If you're able to in the meantime, you can set do-not-evict on these pods you don't want consolidated, and you can also use the do-not-consolidate node annotation as well. More here.

@tareks
Copy link

tareks commented May 14, 2023

Are there any plans to implement or accept such a feature that adds some sort of time delay between node provisioning and consolidation? Perhaps based on the age of a node? The main advantage would be to increase stability during situations where there are surges in workload (scaling, scheduling, or roll outs).

@Hronom
Copy link

Hronom commented May 18, 2023

Hey can you just add delay before start consolidation after prod's change?

You can add several delays:

  1. After last deployed pod
  2. After last consolidation round

This will help to run consolidation during low activity on cluster.

@sftim
Copy link

sftim commented Jun 5, 2023

Also see issue #696: Exponential decay for cluster desired size

@sftim
Copy link

sftim commented Jun 5, 2023

This comment suggests another approach we might consider.

My point was more, I'd like better control over the consolidate decision with Karpenter. If I have a node hosting expensive pods (in terms of restart cost), then a node running at 55% utilization (either memory / cpu) may be acceptable in the short term even if the ideal case is to drain off the pods on that node to reschedule on other nodes. Cluster Auto-scaler provides this threshold setting and doesn't require special termination settings on the pods.

(from #735)

Elsewhere in Kubernetes, ReplicaSets can pay attention to a Pod deletion cost.

For Karpenter, we could have a Machine or Node level deletion cost, and possibly a contrib controller that raises that cost based on what is running there.

Imagine that you have a controller that detects when Pods are bound to a Node, and updates the node deletion cost based on some quality of the Pod. For example: if you have a Pod annotated as starts-up-slowly, you set the node deletion cost for that node to 7 instead of the base value of 0. You'd also reset the value once the node didn't have any slow starting Pods.

@vainkop
Copy link

vainkop commented Jul 8, 2024

Has there been any update on when this may be implemented?

It seems the solution was already implemented in this PR but it hasn't been merged yet 😞

@edas-smith
Copy link

Has there been any update on when this may be implemented?

It seems the solution was already implemented in this PR but it hasn't been merged yet 😞

Ah I see! Thanks for linking me to it :) glad this is just now pending a merge, which hopefully will come soon!

@njtran
Copy link
Contributor

njtran commented Jul 8, 2024

Hey all, after some more design discussion, we've decided not to go with the approach initially taken in
#1218, as it was computationally expensive, hard to reason about, and had to take too many guesses as to how the kube-scheduler would schedule pods in relation to our simulations.

Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.

Let us know what you think of this approach.

@vainkop
Copy link

vainkop commented Jul 8, 2024

Hey all, after some more design discussion, we've decided not to go with the approach initially taken in #1218, as it was computationally expensive, hard to reason about, and had to take too many guesses as to how the kube-scheduler would schedule pods in relation to our simulations.

Alternatively, the approach we're now running with for v1 is using consolidateAfter to filter out nodes for consolidation. When you set consolidationPolicy: WhenUnderutilized and (for example) set consolidateAfter: 1m, Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute. To achieve the same Consolidation behavior prior to v1, you can set consolidateAfter: 0s.

Let us know what you think of this approach.

@njtran please clarify: does it mean that starting with v1 APIconsolidateAfter can be configured along with consolidationPolicy: WhenUnderutilized ? I believe an error or a warning is printed if v1beta1 is used.

Is there any eta for a Karpenter v1 API release with which the above config can actually be used?

Thank you.

@njtran
Copy link
Contributor

njtran commented Jul 8, 2024

@vainkop I mean when this is implemented, if you set consolidateAfter: 0s, it'll be backwards compatible, and present the same behavior as before this was implemented.

@wmgroot
Copy link
Contributor

wmgroot commented Jul 8, 2024

Karpenter will not consolidate nodes that have had a pod scheduled or removed in the last minute.

Are there any plans to support consolidation constraints that do not depend on pod scheduling at all, and instead depend on total node lifetime? I think relying solely on time since pod scheduling occurred could result in these issues, and may not meet our needs as cluster administrators.

  1. Nodes may never be consolidated if pods are churning too frequently (due to frequent deploys or ephemeral jobs scheduling), resulting in inefficient cluster binpacking and wasted resources.
  2. Nodes may be torn down before reaching a desired minimum age. We have license agreements with 3rd party providers such as Datadog that bill us for node with a 1 hour minimum. If a node exists for only 30m, we are still billed for the full hour. The implementation as described would require that we set our consolidateAfter to a minimum of 1 hour to avoid this issue, even if we would rather set it below an hour to better match our desired pod churn rates.

@wmgroot
Copy link
Contributor

wmgroot commented Jul 8, 2024

I do support and think having a consolidation control that's based on pod scheduling is a good idea, but I think it may fall short of meeting all use cases that are asking for better consolidation control. Perhaps having additional fields that limited consolidation would cover all use cases better. I think something like this is what we'd likely want to use if each of these fields were available.

consolidationOptions:
  minimumAge: 6h
  afterPodScheduledPeriod: 30m
  afterPodRemovedPeriod: 15m

@njtran
Copy link
Contributor

njtran commented Jul 8, 2024

  1. Nodes may never be consolidated if pods are churning too frequently (due to frequent deploys or ephemeral jobs scheduling), resulting in inefficient cluster binpacking and wasted resources.
  2. Nodes may be torn down before reaching a desired minimum age. We have license agreements with 3rd party providers such as Datadog that bill us for node with a 1 hour minimum. If a node exists for only 30m, we are still billed for the full hour. The implementation as described would require that we set our consolidateAfter to a minimum of 1 hour to avoid this issue, even if we would rather set it below an hour to better match our desired pod churn rates.

Totally agree here. I think a minimum age would be better solved through the do-not-disrupt annotation, supported on the node for your use-case, but also potentially the pod #752.

afterPodScheduledPeriod: 30m
afterPodRemovedPeriod: 15m

On these two configurations, I wonder if this would be better prescribed with some sort of node headroom, static capacity. This is really to combat the case where a set of job pods go down, and you want to reserve the capacity for some period of time because you know that pods will come back soon?

@ellistarn
Copy link
Contributor

We've discussed a minimumNodeLifetime, that would apply to all disruption methods. I see this as orthogonal to this effort, and something we should explore in a followon.

@wmgroot
Copy link
Contributor

wmgroot commented Jul 8, 2024

I think a minimum age would be better solved through the do-not-disrupt annotation

Are you suggesting that I write a controller to set the do-not-disrupt annotation based on node age myself, or that Karpenter would support this through use of the do-not-disrupt annotation under the hood as an implementation option?

This is really to combat the case where a set of job pods go down, and you want to reserve the capacity for some period of time because you know that pods will come back soon?

afterPodScheduledPeriod would typically be to help with particular applications that have slow startup speeds (we have some stateful apps that rebalance data over several minutes on startup). I could also see it being a potentially better option than the do-not-disrupt annotation for ephemeral pods like CI jobs that typically run less than an hour but can lose a lot of progress or require manual retries if disrupted.

afterPodRemovedPeriod is me taking a guess that a small buffer like this could be helpful with reducing node churn in the cluster, since pods could more easily schedule into gaps left by removed pods. This setting seems identical to CA's scale-down-unneeded-time

Scaling down of unneeded nodes can be configured by setting --scale-down-unneeded-time. Increasing value will make nodes stay up longer, waiting for pods to be scheduled while decreasing value will make nodes be deleted sooner.
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-modify-cluster-autoscaler-reaction-time

We've discussed a minimumNodeLifetime, that would apply to all disruption methods. I see this as orthogonal to this effort, and something we should explore in a followon.

Sure, but are we considering if I want to block consolidation disruption for nodes <6h old, but drift nodes immediately regardless of age? Implementing this as an isolated consolidation control may be the best way forward.

@vainkop
Copy link

vainkop commented Jul 8, 2024

@vainkop I mean when this is implemented, if you set consolidateAfter: 0s, it'll be backwards compatible, and present the same behavior as before this was implemented.

I was of course more interested in the new functionality solving the issue we're discussing rather than smth that presents "the same behavior as before this was implemented"...

@njtran
Copy link
Contributor

njtran commented Jul 8, 2024

@wmgroot

Are you suggesting that I write a controller to set the do-not-disrupt annotation based on node age myself, or that Karpenter would support this through use of the do-not-disrupt annotation under the hood as an implementation option?

This would be a feature supported natively through Karpenter: linked the issue here #752.

afterPodRemovedPeriod is me taking a guess that a small buffer like this could be helpful with reducing node churn in the cluster, since pods could more easily schedule into gaps left by removed pods. This setting seems identical to CA's scale-down-unneeded-time

Makes sense here, we sort of follow this with our 15s validation period in between finding and executing consolidation. We could consider making this configurable in the future too.

@njtran
Copy link
Contributor

njtran commented Jul 8, 2024

@njtran please clarify: does it mean that starting with v1 APIconsolidateAfter can be configured along with consolidationPolicy: WhenUnderutilized ? I believe an error or a warning is printed if v1beta1 is used.

@vainkop, yes you will be able to set consolidateAfter with WhenUnderutilized.

@tareks
Copy link

tareks commented Jul 9, 2024

It may be worth going forward the proposed approach to see how it holds up in the real world.

At first glance, it seems like consolidation based on pod scheduling may not yield desired results, at least in our use case. The trigger to delete nodes appears to more of a "recentlyUtilized" condition (as opposed to WhenUnderutilized.

Here's one scenario:

  1. Nodes are relatively bin packed. (initial state)
  2. Multiple deployments kick off at the same time requiring more resources.
  3. Karpenter provisions new nodes and pods are scheduled on them.
  4. CronJob based pods start being scheduled on the new node(s) resulting in pods scheduled within a few minutes.
  5. One or more of the new nodes may never be consolidated despite being under utilised because they pods constantly being scheduled on them.

I would expect that at some point the cluster will reach some state of equilibrium. Unclear if there may be runaway situations where one cold end up with several under utilised nodes.

Similar to what @Shadowssong mentioned, we've found that this sort of configuration may work as a tactical approach, especially in nonprod clusters:

    disruption:
      budgets:
      - nodes: 10%
      - nodes: "2"
      - duration: 8h
        nodes: "0"
        schedule: 0 9 * * mon-fri
      consolidationPolicy: WhenUnderutilized
      expireAfter: 720h0m0s

It reduces node churn by allowing a short window for karpenter to run it's consolidation cycle. This, in turn, reduces the amount of disruption over the past week or so with (SPOT) nodes now showing significantly longer lifetimes/ages.

@njtran
Copy link
Contributor

njtran commented Jul 9, 2024

One or more of the new nodes may never be consolidated despite being under utilised because they pods constantly being scheduled on them.

Totally makes sense. I imagine you would tune your consolidateAfter to a lower value in this case. Tuning it to a higher value means you want the recently scheduled to nodes less aggressively consolidated. Tuning it to a smaller value, smaller than the rate at which your job pods come in, makes Karpenter more aggressive.

@njtran
Copy link
Contributor

njtran commented Jul 9, 2024

Another point worth noting for the approach mentioned in #735 (comment):

Karpenter prioritizes consolidating nodes that have the least amount of pods scheduled. Compare this to the default plug-in for kube-scheduler, which will schedule pods to the nodes that are the least allocated. Generally, this means that Karpenter's heuristic is least compatible with the kube-scheduler default, scheduling pods to nodes that Karpenter wants to disrupt. If the kube-scheduler was configured to MostAllocated, Karpenter would run into less race cases, and be the most compatible with the proposed solution.

@wmgroot
Copy link
Contributor

wmgroot commented Jul 15, 2024

It reduces node churn by allowing a short window for karpenter to run it's consolidation cycle.

Implementing a maintenance window is undesired for us because we do not want to create special periods of time where disruption can occur vs not. This makes on-call awkward, increases the complexity of information we have to communicate to users of our clusters, and requires unique configuration of our NodePools for each region we operate in.

I'd also like to note that we need to be able to configure consolidation controls separate for consolidation types in some cases. We've currently implemented a patch to disable single-node consolidation altogether after finding it wasn't providing much value in return for the large amounts of disruption to our clusters. Given there's a tangible $$$ cost to restarting applications in some cases, it's entirely possible that single-node consolidation wastes more money than it saves with a naive implementation.

Since single-node consolidation exists solely to replace one EC2 instance with a cheaper variant, having a way to control the threshold at which karpenter will decide to replace the node would be wonderful (eg, only replace if you'll reduce the cost of the node by >15%).

@ellistarn
Copy link
Contributor

Since single-node consolidation exists solely to replace one EC2 instance with a cheaper variant, having a way to control the threshold at which karpenter will decide to replace the node would be wonderful (eg, only replace if you'll reduce the cost of the node by >15%).

+1 for price improvement threshold. I think this is orthogonal to consolidateAfter, though. Do you mind cutting a new issue for this, and referencing https://github.com/kubernetes-sigs/karpenter/blob/main/designs/spot-consolidation.md#2-price-improvement-factor?

@wmgroot
Copy link
Contributor

wmgroot commented Jul 17, 2024

Done.
#1440

@miadabrin
Copy link

Is there a way to have this implemented/released sooner than v1? I think this will give everyone a way to handle the aggressiveness of the scaling down a little better

@samox73
Copy link

samox73 commented Jul 19, 2024

@miadabrin I think the urgency of this issue is evident from the lively discussion. Development for the issue is ongoing at #1218 and the discussion seems semi-active, but I'm unsure how close this PR is to completion.

@wmgroot
Copy link
Contributor

wmgroot commented Jul 19, 2024

@samox73 The maintainers have stated that the PR you linked has been abandoned in favor of a new implementation of consolidateAfter.

#735 (comment)

@sherifabdlnaby
Copy link

Is there a way to have this implemented/released sooner than v1? I think this will give everyone a way to handle the aggressiveness of the scaling down a little better

I second that, it won't be a breaking change per-se because it was disallowed to set consolidateAfter with WhenUnderutilized

@tm-nadavsh
Copy link

tm-nadavsh commented Aug 29, 2024

@Shadowssong

I am also experiencing the high churn issue that has been reported here and until Karpenter supports more configurations around this we found a semi hacky middle ground in between always scaling up or constantly consolidating by using budget schedules:

  spec:
    disruption:
      budgets:
      - duration: 45m
        nodes: "0"
        schedule: '@hourly'
      - nodes: "1"
      consolidationPolicy: WhenUnderutilized

This basically limits consolidation to only happen during the last 15 minutes of each hour and only 1 node at a time. This is by no means a good final solution but it seems to work as expected and you can fine tune the times to prevent constant consolidation. Hope this helps anyone else looking for a temporary workaround.

With the new karpenter version 1.0.1, from this blog post
Introducing consolidateAfter consolidation control for underutilized nodes

Karpenter prioritizes nodes to consolidate based on the least number of pods scheduled. Users with workloads that experience rapid surges in demand or interruptible jobs might have high pod churn and have asked to be able to tune how quickly Karpenter attempts to consolidate nodes to retain capacity and minimize node churn. Previously, consolidateAfter could only be used when consolidationPolicy=WhenEmpty, which is when the last pod is removed. consolidateAfter can now be used when consolidationPolicy= WhenEmptyOrUnderutilized, thus allowing users to specify in hours, minutes, or seconds how long Karpenter waits when a pod is added or removed before consolidating. If you would like the same behavior as v1beta1, then set consolidateAfter to 0 when consolidationPolicy=WhenEmptyOrUnderutilized.

this gives a better solution

@joewragg
Copy link

joewragg commented Sep 5, 2024

Think this can be closed as it's fixed in karpenter 1.0.0

@mamoit
Copy link

mamoit commented Sep 5, 2024

@joewragg It was closed on the 31st of July.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost-optimization deprovisioning Issues related to node deprovisioning kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone
Projects
None yet