-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warm Up Nodes Options (Hibernation) #3798
Comments
It seems like you may need some combination of kubernetes-sigs/karpenter#749 with an option to specify that manual node provisioning as a "warm pool?" Do you know what the capacity is going to look like and you want the warm pool to be right-sized? Or are you just looking to specify some constraints on a manually provisioned warm pool that would look like being able to manually launch Karpenter capacity like listed in kubernetes-sigs/karpenter#749. |
@jonathan-innis I think having a manual node could be helpful to some sort but it doesn't really align well with the provisioner idea unless it's referencing it in some sort.
However, I am looking for something more like
|
Yeah, I think this is being tracked over here #3240. Do you mind including your use-case over there? I think this issue looks like a duplicate of the discussion that's occurring over there. |
Closing this as a duplicate of #3240 |
@jonathan-innis Why was this closed as duplicate? This issue is about a similar option as Warm Pools for ASGs in Karpenter. |
I agree that this is not a duplicate of #3240 That one is about keeping extra nodes active all the time, ready to pick up jobs. This issue is about having some nodes (AWS instances) in shutdown state rather than terminated, such that when a new node is needed the existing machine can be restarted rather than needing to create a new machine from scratch. I use karpenter for managing Gitlab CI build machines, so when a new build job comes in it starts a new machine to run that build job, then shuts the machine down again afterwards. For most of the day there are no machines running, just occasional ones started when a git commit is pushed. Currently, I have a ~1.5 minute delay to a build job while it's creating and provisioning the machine, but at least I'm only paying money while the job is running. I'm in the process of getting going with the new windows support for windows build jobs - it's looking like up to 20 minutes to provision a windows machine and pull a (rather large) docker build image. With #3240 I'd basically end up with at least one "warm" machine running 24/7, incurring significant cost. With the proposal in this issue, I'd have one shutdown machine in AWS ready to restart when a job comes in, which should start up significantly faster, but only cost a little bit of storage fee when shut down. |
@andrewleech you can bake EBS snapshots with most common images you frequently need, and attach those to karpenter nodes, avoiding having to download them on every new node. |
Thanks @FernandoMiguel that's interesting, I didn't realise that was possible. On windows I guess almost everything is based on one of two windows base/core images so it'd certainly be good to have them preloaded, though we use a range of different things in Linux so not sure what I'd load there, worth thinking about though. However on any OS it would mean extra processes needed to create and maintain those snapshots (security updates etc). It's definitely worth testing at least to see how much time it saves, vs the initial time to just create the machine. |
I've tested building a custom windows AMI (using AWS image builder) for my windows nodes with a bunch of container images pre-pulled with crictl. I was also able to enable AWS Fast Start on the image. Using this image is faster with Karpenter, but there's still a ~ 6 minute start up time. The pod logs show the pre-pulled images are all being used, so that did help. I was really hoping for a lot faster though. |
Apologize for missing the back-and-forth here and not re-opening this one earlier. You're correct that I misclassified this one on first glance.
Would shutdown instances still help here or are there other areas that are bottlenecking that you can see? |
Another Data point: Cluster autoscaler managed on AKS has a "deallocate" scale down mode. Where rather than deleting vms, we put them in "deallocated mode" which essentially is the same as hibernation. Then when you need to scale up you wake up one of the hibernated instances. Jack is taking a stab at upstreaming the change here for reference. Some users who require 1s latency are ok paying for the os disk with the tradeoff that the VM will start immediately when they need it.
I am also curious the full breakdown of the bottlenecks you are facing. If the bottleneck is with image pull, hibernated instances may not save you as much time, and something optimizing image pull may make more sense like you tried but you can probably go deeper. Hibernated instances may save you 30-45s, but for some larger container images such as Solving at the node bootstrapping layer is just one layer of potential latency. Haven't dove deep on the aws side but imagine similar things are achievable via completely optimizing image pull |
Given the number of upvotes on this and linked issues, will this feature be made available soon? |
Here's a blog post showing how using shutdown instances can decrease boot time: https://depot.dev/blog/faster-ec2-boot-time I imagine something like that, combined with pre-loading images, could make adding new nodes very fast. |
@jonathan-innis How can we make this reality? In theory and practical, I have seen that works and I have a good idea on how may be able to execute that, wdyt the next steps should be It's 2nd on the top requested feature, https://github.com/aws/karpenter-provider-aws/issues?q=is%3Aissue%20state%3Aopen%20sort%3Areactions-%2B1-desc |
Tell us about your request
Allow Kaprneter to provision more nodes in a hibernated state which would decrease the new nodes provisioning time for rapid scaling.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Kaprneter is excellent for optimizing the cluster capacity; however, on the other side, applications that require rapid scaling will need to wait until the new nodes are provisioned.
There is a proposal here to add headroom logic, But that means we are still going to have running nodes with no workloads which they are being charged for.
Another option is to support Hibernation (Stopped Instances), which will be bootstrapped and ready to join the clusters once needed. This feature is already supported out of the box as Warm Pool for Auto Scaling Group
Are you currently working around this issue?
Using Low Priority Pods could be less practical from a cost-saving perspective. Similar to #3240
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: