Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds design doc to discuss Karpenter limits #788

Merged
merged 3 commits into from
Nov 19, 2021
Merged

Conversation

suket22
Copy link
Contributor

@suket22 suket22 commented Nov 9, 2021

Discusses limits to the amount of resources that Karpenter provisions. I've left implementation details out of here but wanted to document and align on our thinking about this before I move forward with code changes.

1. Issue, if available:
N/A

2. Description of changes:
Just a doc.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

Docs will come in once we've implemented the thing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Nov 9, 2021

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: b2f9d2f

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/618d73539051a50007c86158

😎 Browse the preview: https://deploy-preview-788--karpenter-docs-prod.netlify.app

@bwagner5 bwagner5 changed the title Adds doc to discuss Karpenter limits Adds design doc to discuss Karpenter limits Nov 10, 2021
@bwagner5 bwagner5 added the needs-design Design required label Nov 10, 2021
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice design! I have a few questions and comments.


```yaml
spec:
limits:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get hard to manage if users have a lot of provisioners especially if we end up having pods that could schedule to multiple provisioners. I wonder if we should also have an optional limit passed directly to the karpenter binary that would limit the total amount of resources karpenter is managing across all provisioners.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, that makes a lot of sense. I don't think I considered what limits would mean with multiple provisioners.
I like the idea of having a single global limit we pass directly to the controller binary. I'll add it in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a little more in the working group and there were some other concerns with establishing a cluster-wide limit from @ellistarn - like what if there's multiple karpenter installations etc.

It sounds like we have a pretty compelling use case for having limits per provisioner but not so sure about the cluster-wide case just yet. Should we punt on that for now?


This is a much simpler semantic to follow, where you provide an upper cap on the amount of resources being provisioned irrespective of their health. This is more in-line with alternative solutions like CA and should be easier to reason with.

The actual limit of the number of resources being spun up can be defined either through something similar to resource requests or just a flat count of instances.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider supporting number of pods as an upper limit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we're going to do absolute limits should we break it out by capacity type? You may, for example, not want any limit on Spot capacity but have a restrictive limit on OD capacity. This would be useful in the case that you're trying to maintain high availability but only within a reasonable cost.

Copy link
Contributor Author

@suket22 suket22 Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider supporting number of pods as an upper limit?

I'm not sure how I'd be able to set that value reasonably. In a cluster I'd probably have multiple deployments running with HPA behind the scenes so it'll be difficult for me to guess an appropriate bound. I also think since Karpenter provisions your infrastructure, its limits being modeled as infra characteristics like cpu / mem / gpu would be a little more intuitive too. Limiting the number of pods could be left to pod-autoscalers like HPA?

Also, if we're going to do absolute limits should we break it out by capacity type?

Ooh this would be nice. I know we just made capacity-type a well known concept too so I wouldn't be against it. Although I suppose we could be even more generic where you could also break it down by other well known constraints like availability-zone? With AZs, we could kind of enforce Karpenter does some form of balancing across zones for higher availability. I wonder if this is too much complexity to start with?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the balancing is probably a bit advanced and maybe overloads this limits concept a bit, but I think factoring in having the limit be per capacity type or across all capacity types is still reasonable. Especially since a big part of why this limit will exist is to control costs and different capacity types can (and in AWS's case certainly do) have drastic cost implications.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair on the balancing bit. Following up on the conversation in the working group meeting though, this might be a little difficult to represent and use if we keep adding new dimensions to split the limits. Even with say capacity type, it might end up looking something like -

  limits:
    resources:
        cpu: 1000 # irrespective of capacity type?
      capacity-type: on-demand
        cpu: 1000
      capacity-type: spot
        cpu: 500

or 

  limits:
    resources:
        cpu:
          capacity-types: 
             spot: 500
             on-demand: 1000
         mem:
  ....

And it'd start to get more complicated from there. I think to the other feedback that these limits are more to bound any scaling when something seems really off and isn't so much about precise cost control, I think we could probably leave this optimization out. Does that sound reasonable to you?

@suket22 suket22 merged commit 2f18231 into aws:main Nov 19, 2021
@suket22 suket22 deleted the limitsDesign branch November 19, 2021 00:01
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-design Design required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants