Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Issue: Manual node provisioning #749

Open
ellistarn opened this issue Jul 5, 2022 · 68 comments
Open

Mega Issue: Manual node provisioning #749

ellistarn opened this issue Jul 5, 2022 · 68 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0

Comments

@ellistarn
Copy link
Contributor

ellistarn commented Jul 5, 2022

Tell us about your request
What do you want us to build?

I'm seeing a number of feature requests to launch nodes separately from pending pods. This issue is intended to broadly track this discussion.

Use Cases:

  • Create a System pool to run components like karpenter, loadbalancer, coredns, etc
  • Provision baseline capacity that never scales down
  • Manually preprovision a set of nodes before a large event

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ellistarn ellistarn added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 5, 2022
@ellistarn ellistarn changed the title Mega issue: Support manual node provisioning Manual node provisioning Jul 5, 2022
@ellistarn
Copy link
Contributor Author

One design option would be to introduce a Node Group custom resource that maintains a group of nodes with a node template + replica count. This CR would be identical to the Provisioner CR, except TTLs are replaced with replicas.

apiVersion: karpenter.sh/v1alpha5
kind: NodeGroup
metadata:
  name: default
spec:
  replicas: 1
  taints: [...]
  requirements:
    - key: karpenter.k8s.aws/instance-size
      operator: In
      values: ["large"]
    - key: karpenter.k8s.aws/instance-family
      operator: In
      values: ["c5", "r5", "m5"]
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot"]
  providerRef:
    name: my-provider
---
apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  name: my-provider
spec:
  subnetSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}" 
  securityGroupSelector:
    karpenter.sh/discovery: "${CLUSTER_NAME}"

@ellistarn ellistarn changed the title Manual node provisioning Mega Issue: Manual node provisioning Jul 5, 2022
@FernandoMiguel
Copy link

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

@ellistarn
Copy link
Contributor Author

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

I agree that many of these cases are handled by simply using ASG or MNG. Still worth collating these requests to see if this assumption is bad for some cases.

@FernandoMiguel
Copy link

most of us have aws managed node groups on ASG with at least 2 nodes to handle this

I agree that many of these cases are handled by simply using ASG or MNG. Still worth collating these requests to see if this assumption is bad for some cases.

I would love to have karpenter handle if all... But we still need a place to run karpenter from.

And only dirty way I see it doing that is to deploy an ec2, deploy karpenter there with two replicas with anti affinity hostname, karpenter would deploy a second node now managed by karpenter, deploy the second replica, and kill off the first manually deployed vm.

Or we can just have it tagged with something that makes karpenter manage it until it hits its TTL.

@gazal-k
Copy link

gazal-k commented Jul 10, 2022

We're considering running karpenter and coredns on Fargate and karpenter then provisioning capacity for everything else.

I believe there was some documentation about this somewhere. Also, it was an AWS SA who suggested also running coredns on Fargate (we were originally thinking about just running karpenter on Fargate)

@gazal-k
Copy link

gazal-k commented Jul 10, 2022

Manually preprovision a set of nodes before a large event

For this use case, wouldn't changing the minReplicas on desired application HPAs work better? That's what we do, so that there is no delay in spinning up more Pods for a rapid surge in traffic.

@FernandoMiguel
Copy link

We're considering running karpenter and coredns on Fargate and karpenter then provisioning capacity for everything else.

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures.
Not the best ops experience.
Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

@gazal-k
Copy link

gazal-k commented Jul 10, 2022

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. Not the best ops experience. Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

Does this not work: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html?

@FernandoMiguel
Copy link

I tried that a couple weeks ago and was a very frustrating experience with huge deployment times, and many many timeouts and failures. Not the best ops experience. Until coredns is fargate native without stupid hacks to modify the deployment, I don't believe this is the best path.

Does this not work: https://docs.aws.amazon.com/eks/latest/userguide/fargate-getting-started.html?

We are a terraform house, so the steps are slightly different.
I've had eks clusters with fargate only workloads work a few times, but it's really hit and miss kinda of deployments. Patching coredns is and hard problem.

@ermiaqasemi
Copy link

In some cases, especially in critical workloads that always need some node to be up and running like auto-scaling, It will be great if we can set that X minimum node is always available for scheduling nodes.

@realvz
Copy link

realvz commented Oct 13, 2022

Another use case is when you'd like to prewarm nodes at scheduled times. Currently, customers have to wait for Karpenter to provision nodes when pods are pending or create dummy pods that trigger scaling before production workload begins. Reactive scaling is slow and the alternative seems a workaround.

Ideally, customers should be able to create a provisioner schedule that creates and deletes nodes based on a defined schedule. Alternatively, Karpenter can have a CRD that customers can manipulate themselves to precreate nodes (without having pending pods).

@cove
Copy link

cove commented Oct 20, 2022

our use case is we need to increase our node count during version upgrades which can take hours/days, during that time we cannot have any scale downs, so being able to have our upgrading app be able to manually control what's going on would be ideal.

(also for context, our case isn't a web app, but an app the maintains a large in memory state that needs to be replicated during an upgrade, before being swapped out.)

@mattsre
Copy link

mattsre commented Nov 6, 2022

Following from the "reserve capacity" or Ocean's "headroom" issue here: aws/karpenter-provider-aws#987

Our specific use case is we have some vendored controller that polls an API for workloads, and then schedules pods to execute workloads as they come in. The vendored controller checks to see if nodes have the resources to execute the workload before creating a pod for it. Because of this, no pods are ever created once the cluster is considered "full" by the controller. We've put in a feature request to the vendor to enable a feature flag on this behavior, but I still think there could be benefit to having some headroom functionality as described in Ocean docs here: https://docs.spot.io/ocean/features/headroom for speedier scaling

Maybe headroom could be implemented on a per provisioner level? The provisioners already know exactly how much cpu/memory they provision, and with the recent consolidation work I'd assume there's already some logic for knowing how utilized the nodes themselves are.

@grosser
Copy link

grosser commented Nov 6, 2022 via email

@grosser
Copy link

grosser commented Nov 9, 2022

FYI here is a rough draft how I think this feature could look like
... basically have a new field on the provisioner and then add fake pods before the scheduler makes it's decisions
#62

@crazywill
Copy link

We would also hope Karpenter to support warm pool. Right now it takes 6 minutes to spin up a node, which is too long for us. We would like to have a warm pool feature similar to asg.

@abebars
Copy link

abebars commented Apr 26, 2023

+1 to warm pool, I opened another issue which is more towards the warm pool options. @crazywill feel free to chime in there if you have a chance

@billrayburn billrayburn added the v1.x Issues prioritized for post-1.0 label Apr 26, 2023
@jackfrancis
Copy link
Contributor

Adding some thoughts here.

We could emulate cluster-autoscaler's min-replica-count and max-nodes-total approach?

  1. If you set a minimum, then the karpenter provisioner will, if necessary, statically scale out to that number of nodes
  2. If you set a maximum, then the karpenter provisioner will not scale beyond that node count
  3. If you set minimum and maximum to the same value, the karpenter provisioner will set the cluster node count to a static number (the value of minimum and maximum)
  4. If you do not provide a minimum configuration, the default is "no minimum" (effectively we give karpenter provisioner permission to scale in to zero nodes)
  5. If you do not provider a maximum configuration, the default is place no node count limit on karpenter provisioner's ability to create more nodes, if necessary

One obvious difference between cluster-autoscaler and karpenter is that by design "number of nodes" is not a first class operational attribute in terms of describing cluster node capacity (because nodes are not homogeneous). So using "minimum number of nodes" to express desired configuration for solving some of the stories here (specifically the warm pool story) isn't sufficient by itself: you would also need "type/SKU of node". With both "number" + "SKU" you can deterministically guarantee a known capacity, and now you're sort of copying the cluster-autoscaler approach.

However, the above IMO isn't really super karpenter-idiomatic. It would seem better to express "guaranteed minimum capacity" in a way that was closer to the operational reality that actually informs the karpenter provisioner. Something like:

  • minCPURequestsAvailable
  • minMemRequestsAvailable

etc.

Basically, some sufficient amount of input that karpenter could use to simulate a "pod set" to then sort of "dry run" into the scheduler:

  • Take my existing Ready nodes and assume that nothing is actually running on them
  • Do said Ready nodes fulfill the "dry run" configuration? (e.g., could I schedule 500 CPU cores, 1TB memory?, whatever)

It gets trickier when you consider the full set of critical criteria that folks use in the pod/scheduler ecosystem: GPU, confidential compute, etc. But I think it's doable.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@chamarakera
Copy link

When migrating from Cluster Auto Scaler to Karpenter, we would like Karpenter to provision a node beforehand, before we perform a drain on the old node. It takes time for Karpenter to provision a new node based on the unscheduled workloads, and due to this, the pods are kept in a Pending state for too long.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 7, 2024

@chamarakera would just requesting a node via nodeclaim be enough in this case? No managed solution is requried if you just want to create a node from my testing just applying nodeclaims with a reference to valid nodepool is enough to create a node ahead of time in karpenter.

Note this example uses a instance type size from azure.

k apply -f nodeclaim.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  name: temporary-capacity
  labels:
    karpenter.sh/nodepool: general-purpose
  annotations:
    karpenter.sh/do-not-disrupt: "true"
spec:
  nodeClassRef:
    name: default
  requirements:
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - Standard_D32_v3
  resources:
    requests:
      cpu: 2310m
      memory: 725280Ki
      pods: "7"
status: {}

@njtran
Copy link
Contributor

njtran commented Feb 7, 2024

@chamarakera how bad is the pod latency you're describing here? Do you have a long bootstrap/startup time on your instance? What would be an acceptable amount of pod latency to not need prewarmed instances ?

@chamarakera
Copy link

chamarakera commented Feb 8, 2024

@Bryce-Soghigian - To use NodeClaims, we would need to generate several NodeClaim resources depending on the cluster's size. I would like to see a configurable option within the NodePool itself to provision nodes prior to the migration. I think, having a way to configure min size/max size parameters in the NodePool itself would be a good solution.

@njtran - Usually it takes for 1 - 2mins for startup, this is ok in non-prod, but in a production I would like to have pending pods scheduled in a node as soon as possible (within few seconds).

@cdenneen
Copy link

cdenneen commented Feb 8, 2024 via email

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Feb 16, 2024

After chatting a bit in Slack on the Karpenter channel with awiesner4 I think i have some thoughts around this problem.

First I want to bring up Karpenter's primary objective then break down Karpenter's current responsibility and maybe this will help drive the design choice. I think Karpenter's main objective is be an efficiency cluster autoscaler so it makes it difficult for leaving around nodes that isn't doing work go against what Karpenter is trying to achieve. I think to add something that works around it would likely be problematic because you would have to work around everything Karpenter is built to do.

However, at the moment is doing more than just autoscaling which is where I think problems and usability issue arises. It autoscales but it also manages nodes. It takes over how nodes are configured and the lifecycle of nodes and it closes the door for other things to manage nodes.

#742
#688
#740
and this issue

What does this mean? I agree with those who are saying nodeclaim should be able to create nodes essentially outside of Karpenter's main autoscaling logic so we don't change what Karpenter is trying to do (save money). I think at this time NodePools does hold the logic and concept of how Karpenter tries to optimize the cluster so I don't think manual provisioning should live there. On the max/min nodes on NodePool, it was pointed out to me that how would a nodepool know what instance types those min nodes should be and in order to protect those min nodes Karpenter would have to cut through most of its disruption logic to support don't drop node count below min.

That isn't the say supporting node management or different autoscaling priority isn't possible but I think the entity that contains that logic should not be node pool in its current form. If Karpenter expands NodePool such that it is extensible, _karpenter_nodepool_provider_N, that allows users to group nodeclaims with different intention that differs from the primary objective of Karpenter. Users can create nodepools variants where keeping a min make sense. Where scheduling logic is different and so on.

From my view, I think its important to keep Karpenter's main focus clear and clean because its complicated enough. But if we allow for more extension on the base concepts, we may be able to support use cases to fall out of what Karpenter is trying to do.

@Bryce-Soghigian
Copy link
Member

Bryce-Soghigian commented Feb 16, 2024

However, at the moment is doing more than just autoscaling which is where I think problems and usability issue arises.

There are mentions of node autohealing, using budgets to manage upgrades, etc. Seems its moving to be a node lifecycle management tool. It has more value than just autoscaling. So I am for a static provisioning CR that helps manage lifecycle of static pools.

@SatishReddyBethi
Copy link

Hi. I am really looking forward to this feature too. Is there an open PR for it? Or is it still in the discussions phase?

@cloudwitch
Copy link

We have 9 minimum nodes in our ASG for a batch job workload that gets kicked off by users through a UI.

The users find the EC2 spin-up time unacceptable and expect their pod to spin up quicker than the EC2 can start.

We must have a way to run a minimum number of nodes in a nodepool.

@sftim
Copy link

sftim commented Apr 12, 2024

We must have a way to run a minimum number of nodes in a nodepool.

You can already do that (run low-priority placeholder Pods), but AFAIK there's no controller that does exactly this.
Maybe I'll put some time in and try to write one.

@tuxillo
Copy link

tuxillo commented Apr 12, 2024

Same case here, we need to have a minimum set of nodes, which should be evenly spread among the AZs (AWS). We want to have always extra capacity available for the workload peaks and we can't wait for the spin up/down dance most of the time.

@sftim not sure but maybe the cluster autoscaler provides something similar? About the low-prio placeholder Pods, have you seen any good guide to do it? Sounds like a hack tho.

@z0rc
Copy link

z0rc commented Apr 12, 2024

Cluster Autoscaler documents how to overprovision cluster to offset node provisioning by running preemptable pods. https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler

In the same time this solution isn't exclusive to Cluster Autoscalers, it works just fine with Karpenter and any other potential autoscaler. I wouldn't consider it as a hack, it's implemented via stable kubernetes resources using common practices.

There is ready to use helm chart at https://github.com/deliveryhero/helm-charts/tree/master/stable/cluster-overprovisioner.

@tallaxes
Copy link

Cluster Autoscaler now also supports ProvisioningRequest CRD

@evanlivingston
Copy link

There is an additional financial impact caused by the de-facto solution for a warm pool of capacity by overprovisioning. When high priority pods are scheduled, preempting the overprovision workloads, Karpenter will scale the nodegroup immediately in order to regain capacity to reschedule the overprovision workloads. In some cases it may be preferable for the headroom to be elastic, such that the desired headroom is restored after a deployment or batch job completes. This could be accomplished by setting a desired range of overprovisioned resources. But I also believe @sftim minimumPodPriority solution presents an acceptable solution.

@jwcesign
Copy link
Contributor

jwcesign commented May 21, 2024

Cluster Autoscaler now also supports ProvisioningRequest CRD

Does Karpenter have a plan to implement this? It's really helpful for AI workloads.

@riiv-hexagon
Copy link

+1

@ellistarn
Copy link
Contributor Author

Cross ref-ing the ProvisioningRequest ask here: #742 (comment)

@jackfrancis
Copy link
Contributor

cc @raywainman who is tracking warm replicas stories on behalf of WG Serving here:

https://docs.google.com/document/d/1QsN4ubjerEqo5L4bQamOFFS2lmCv5zNPis2Z8gcIITg

@jonathan-innis
Copy link
Member

For everyone's context, we did a little bit of ideating and came-up with an API that we were pretty happy with from the Karpenter side (see https://github.com/jonathan-innis/karpenter-headroom-poc). We're having an open discussion with the CAS folks about the difference between how we are thinking about the Headroom API and the ProvisioningRequest API, feel free to take a look and comment on the doc if you have any thoughts: https://docs.google.com/document/d/1SyqStWUt407Rcwdtv25yG6MpHdNnbfB3KPmc4zQuz1M/edit?usp=sharing

@balmha
Copy link

balmha commented Jul 22, 2024

Hi guys, how's it going?. I have been looking for a fixed provisioning solution with Karpenter-

We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.

Some workaround or thougts about how to solve it, we really want to fully use Karpenter for our workload.

@jwcesign
Copy link
Contributor

Hi, @balmha

Based on Karpenter, we developed a feature to ensure a minimum number of non-spot replicas for each workload, as illustrated below:
image

Under the hood, it's a webhook component that monitors the distribution of each workload and modifies the pods' affinity to prefer spot instances while requiring some to run on-demand. You can check it out here: CloudPilot Console.

This is not a promotion, just a technical communication.

@gazal-k
Copy link

gazal-k commented Jul 22, 2024

We need to have a minimum amount of on-demand nodes (I'd say 2, one per AZ) and then scale-up/down with spot instances the rest of the workload, I haven't found a solution yet, we tried to play with "On-Demand/Spot Ratio Split" but didn't work and continues spreading Spot instances for the whole workload.

You don't need this issue resolved for your use case @balmha.

  • Create 1 NodePool with requirement karpenter.sh/capacity-type set to ["on-demand"] and higher weight. Set limits to control how much on-demand "base capacity" you need.
  • Create another NodePool with requirement karpenter.sh/capacity-type set to ["spot", "on-demand"] (it'll schedule spot nodes and only fallback to on-demand in case of availability issues) with lower weight.

Even better, you can let your workloads that can't tolerate interruptions, set nodeSelectors or affinity to run on on-demand nodes. That's something which is cleaner to do with Karpenter than on CA.

@stevehipwell
Copy link

@jonathan-innis I really like the look of the headroom APIs, I think they'd cover the requirements I was talking about in #993. Is there a separate issue to track the headroom APIs? Google docs are blocked from our corporate machines.

@Scrat94
Copy link

Scrat94 commented Sep 29, 2024

Is this still planned to be implemented? Any kind of ETA?

Our use case also would greatly benefit from it: We scale up GitLab Runners for our organization. However the cold-starts (60-90 secs) is not ideal for CI/CD. Having 1-2 nodes always available as headroom would make sure that every pipeline coming in, directly get started. Ideally, the minimum number of nodes can be set for a specific schedule (e.g. office hours only) to make sure that during night or weekend it ramps down to zero to minimize costs.
(Or is there any other approach I can choose to fulfill my usecase with e.g. EKS + Karpenter?)

@viettranquoc
Copy link

Our team is using blue/green deployment and always needs to wait a few minutes to make the new nodes come. And then Karpenter will disrupt nodes (after deployment, there will be an overuse of resources).
It would be great if we could allow a buffer like CPU or Memory. For example, if we constantly use a 1000m CPU, it would be a 1200m—20% buffer for deployment and cronjob running.

@woehrl01
Copy link

woehrl01 commented Nov 2, 2024

One suggestion for always have a fixed set of nodes available is having a deployment with proper topologyspreadconstraints +pdb to force the pods one node per replica. (set very low resource requirements)

Don't set a lower priority so it won't get prempted (as in the prewarm/headroom Szenario mentioned before).

If you combine this with keda you can do that even dynamically basted on external factors, e.g. cron time

@m00lecule
Copy link

m00lecule commented Dec 16, 2024

I believe karpenter should follow up the kubernetes deployment design, and implement following rolling update fields on NodePool CRD level:

In general those options should take place during nodepool reconcillation and everybody would be happy. The project goal is to replace EKS node groups, which indicates that some basic autoscaling group features should be mapped into karpenter and rolling update with high availability is essential.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.