Warm Up Nodes Options (Hibernation) #3798

abebars · 2023-04-24T04:22:04Z

Tell us about your request

Allow Kaprneter to provision more nodes in a hibernated state which would decrease the new nodes provisioning time for rapid scaling.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Kaprneter is excellent for optimizing the cluster capacity; however, on the other side, applications that require rapid scaling will need to wait until the new nodes are provisioned.
There is a proposal here to add headroom logic, But that means we are still going to have running nodes with no workloads which they are being charged for.

Another option is to support Hibernation (Stopped Instances), which will be bootstrapped and ready to join the clusters once needed. This feature is already supported out of the box as Warm Pool for Auto Scaling Group

Are you currently working around this issue?

Using Low Priority Pods could be less practical from a cost-saving perspective. Similar to #3240

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis · 2023-04-24T18:06:08Z

It seems like you may need some combination of kubernetes-sigs/karpenter#749 with an option to specify that manual node provisioning as a "warm pool?"

Do you know what the capacity is going to look like and you want the warm pool to be right-sized? Or are you just looking to specify some constraints on a manually provisioned warm pool that would look like being able to manually launch Karpenter capacity like listed in kubernetes-sigs/karpenter#749.

abebars · 2023-04-26T20:54:22Z

It seems like you may need some combination of kubernetes-sigs/karpenter#749 with an option to specify that manual node provisioning as a "warm pool?"

Do you know what the capacity is going to look like and you want the warm pool to be right-sized? Or are you just looking to specify some constraints on a manually provisioned warm pool that would look like being able to manually launch Karpenter capacity like listed in kubernetes-sigs/karpenter#749.

@jonathan-innis I think having a manual node could be helpful to some sort but it doesn't really align well with the provisioner idea unless it's referencing it in some sort.
so if we are doing a manual node I would expect something like

apiVersion: karpenter.sh/v1alpha5
kind: NodeGroup
metadata:
  name: default
spec:
  replicas: 2
  provisionerRef:
    name: my-provisioner

However, I am looking for something more like

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  ......
  # Resource limits constrain the total size of the cluster.
  # Limits prevent Karpenter from creating new instances once the limit is exceeded.
  limits:
    resources:
      cpu: "1000"
      memory: 1000Gi
 # Buffer will be added to the total number required to ensure there is extra space for scaling
 # This can be an absolute number or percentage of the total provisioned nodes 
  buffer:
    resources:
      cpu: "10" OR "10%"
      memory: 100Gi OR "10%"
      warm: true # If this is true, nodes will be hibernated; otherwise, nodes are available and in a ready state.

jonathan-innis · 2023-04-28T19:25:59Z

Yeah, I think this is being tracked over here #3240. Do you mind including your use-case over there? I think this issue looks like a duplicate of the discussion that's occurring over there.

jonathan-innis · 2023-05-01T21:40:27Z

Closing this as a duplicate of #3240

a7i · 2023-06-14T01:40:57Z

@jonathan-innis Why was this closed as duplicate? This issue is about a similar option as Warm Pools for ASGs in Karpenter.
The duplicate issue referenced is for overprovisioner.

andrewleech · 2023-08-11T03:42:56Z

I agree that this is not a duplicate of #3240

That one is about keeping extra nodes active all the time, ready to pick up jobs.

This issue is about having some nodes (AWS instances) in shutdown state rather than terminated, such that when a new node is needed the existing machine can be restarted rather than needing to create a new machine from scratch.

I use karpenter for managing Gitlab CI build machines, so when a new build job comes in it starts a new machine to run that build job, then shuts the machine down again afterwards. For most of the day there are no machines running, just occasional ones started when a git commit is pushed.

Currently, I have a ~1.5 minute delay to a build job while it's creating and provisioning the machine, but at least I'm only paying money while the job is running.

I'm in the process of getting going with the new windows support for windows build jobs - it's looking like up to 20 minutes to provision a windows machine and pull a (rather large) docker build image.

With #3240 I'd basically end up with at least one "warm" machine running 24/7, incurring significant cost.

With the proposal in this issue, I'd have one shutdown machine in AWS ready to restart when a job comes in, which should start up significantly faster, but only cost a little bit of storage fee when shut down.

FernandoMiguel · 2023-08-11T11:24:27Z

@andrewleech you can bake EBS snapshots with most common images you frequently need, and attach those to karpenter nodes, avoiding having to download them on every new node.
should improve your boot time considerably

andrewleech · 2023-08-11T23:19:29Z

Thanks @FernandoMiguel that's interesting, I didn't realise that was possible.

On windows I guess almost everything is based on one of two windows base/core images so it'd certainly be good to have them preloaded, though we use a range of different things in Linux so not sure what I'd load there, worth thinking about though.

However on any OS it would mean extra processes needed to create and maintain those snapshots (security updates etc).

It's definitely worth testing at least to see how much time it saves, vs the initial time to just create the machine.

andrewleech · 2024-01-11T20:38:11Z

I've tested building a custom windows AMI (using AWS image builder) for my windows nodes with a bunch of container images pre-pulled with crictl.

I was also able to enable AWS Fast Start on the image.

Using this image is faster with Karpenter, but there's still a ~ 6 minute start up time.

The pod logs show the pre-pulled images are all being used, so that did help. I was really hoping for a lot faster though.

jonathan-innis · 2024-01-15T06:40:41Z

Apologize for missing the back-and-forth here and not re-opening this one earlier. You're correct that I misclassified this one on first glance.

The pod logs show the pre-pulled images are all being used, so that did help. I was really hoping for a lot faster though

Would shutdown instances still help here or are there other areas that are bottlenecking that you can see?

Bryce-Soghigian · 2024-04-03T05:46:33Z

Another Data point: Cluster autoscaler managed on AKS has a "deallocate" scale down mode. Where rather than deleting vms, we put them in "deallocated mode" which essentially is the same as hibernation. Then when you need to scale up you wake up one of the hibernated instances.

Jack is taking a stab at upstreaming the change here for reference.

Some users who require 1s latency are ok paying for the os disk with the tradeoff that the VM will start immediately when they need it.

Would shutdown instances still help here or are there other areas that are bottlenecking that you can see?

I am also curious the full breakdown of the bottlenecks you are facing. If the bottleneck is with image pull, hibernated instances may not save you as much time, and something optimizing image pull may make more sense like you tried but you can probably go deeper.

Hibernated instances may save you 30-45s, but for some larger container images such as sagemathinc/cocalc that take 405.3s to Start the image, can be reduced to 2.9s using things like Artifact Streaming and overlaybd

Source

Solving at the node bootstrapping layer is just one layer of potential latency. Haven't dove deep on the aws side but imagine similar things are achievable via completely optimizing image pull

myloginid · 2024-05-14T12:27:19Z

Given the number of upvotes on this and linked issues, will this feature be made available soon?

jtdoepke · 2024-06-03T14:55:13Z

Here's a blog post showing how using shutdown instances can decrease boot time: https://depot.dev/blog/faster-ec2-boot-time

I imagine something like that, combined with pre-loading images, could make adding new nodes very fast.

abebars · 2025-01-21T16:10:01Z

@jonathan-innis How can we make this reality? In theory and practical, I have seen that works and I have a good idea on how may be able to execute that, wdyt the next steps should be

It's 2nd on the top requested feature, https://github.com/aws/karpenter-provider-aws/issues?q=is%3Aissue%20state%3Aopen%20sort%3Areactions-%2B1-desc

abebars added the feature New feature or request label Apr 24, 2023

abebars mentioned this issue Apr 26, 2023

Mega Issue: Manual node provisioning kubernetes-sigs/karpenter#749

Open

jonathan-innis closed this as completed May 1, 2023

jonathan-innis reopened this Jan 15, 2024

Bryce-Soghigian mentioned this issue May 2, 2024

Support AWS Warm Pools for karpenter #4354

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warm Up Nodes Options (Hibernation) #3798

Warm Up Nodes Options (Hibernation) #3798

abebars commented Apr 24, 2023

jonathan-innis commented Apr 24, 2023

abebars commented Apr 26, 2023 •

edited

Loading

jonathan-innis commented Apr 28, 2023

jonathan-innis commented May 1, 2023

a7i commented Jun 14, 2023

andrewleech commented Aug 11, 2023

FernandoMiguel commented Aug 11, 2023

andrewleech commented Aug 11, 2023

andrewleech commented Jan 11, 2024

jonathan-innis commented Jan 15, 2024

Bryce-Soghigian commented Apr 3, 2024 •

edited

Loading

myloginid commented May 14, 2024

jtdoepke commented Jun 3, 2024

abebars commented Jan 21, 2025 •

edited

Loading

Warm Up Nodes Options (Hibernation) #3798

Warm Up Nodes Options (Hibernation) #3798

Comments

abebars commented Apr 24, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

jonathan-innis commented Apr 24, 2023

abebars commented Apr 26, 2023 • edited Loading

jonathan-innis commented Apr 28, 2023

jonathan-innis commented May 1, 2023

a7i commented Jun 14, 2023

andrewleech commented Aug 11, 2023

FernandoMiguel commented Aug 11, 2023

andrewleech commented Aug 11, 2023

andrewleech commented Jan 11, 2024

jonathan-innis commented Jan 15, 2024

Bryce-Soghigian commented Apr 3, 2024 • edited Loading

myloginid commented May 14, 2024

jtdoepke commented Jun 3, 2024

abebars commented Jan 21, 2025 • edited Loading

abebars commented Apr 26, 2023 •

edited

Loading

Bryce-Soghigian commented Apr 3, 2024 •

edited

Loading

abebars commented Jan 21, 2025 •

edited

Loading