Adds design doc to discuss Karpenter limits #788

suket22 · 2021-11-09T23:08:46Z

Discusses limits to the amount of resources that Karpenter provisions. I've left implementation details out of here but wanted to document and align on our thinking about this before I move forward with code changes.

1. Issue, if available:
N/A

2. Description of changes:
Just a doc.

3. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

Docs will come in once we've implemented the thing.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2021-11-09T23:08:52Z

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: b2f9d2f

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/618d73539051a50007c86158

😎 Browse the preview: https://deploy-preview-788--karpenter-docs-prod.netlify.app

designs/limits.md

bwagner5

Nice design! I have a few questions and comments.

bwagner5 · 2021-11-10T21:20:09Z

designs/limits.md

+
+```yaml
+spec:
+  limits:


This will get hard to manage if users have a lot of provisioners especially if we end up having pods that could schedule to multiple provisioners. I wonder if we should also have an optional limit passed directly to the karpenter binary that would limit the total amount of resources karpenter is managing across all provisioners.

Oh right, that makes a lot of sense. I don't think I considered what limits would mean with multiple provisioners.
I like the idea of having a single global limit we pass directly to the controller binary. I'll add it in.

We discussed this a little more in the working group and there were some other concerns with establishing a cluster-wide limit from @ellistarn - like what if there's multiple karpenter installations etc.

It sounds like we have a pretty compelling use case for having limits per provisioner but not so sure about the cluster-wide case just yet. Should we punt on that for now?

designs/limits.md

eptiger · 2021-11-10T20:56:23Z

designs/limits.md

+
+This is a much simpler semantic to follow, where you provide an upper cap on the amount of resources being provisioned irrespective of their health. This is more in-line with alternative solutions like CA and should be easier to reason with.
+
+The actual limit of the number of resources being spun up can be defined either through something similar to resource requests or just a flat count of instances.


Should we consider supporting number of pods as an upper limit?

Also, if we're going to do absolute limits should we break it out by capacity type? You may, for example, not want any limit on Spot capacity but have a restrictive limit on OD capacity. This would be useful in the case that you're trying to maintain high availability but only within a reasonable cost.

Should we consider supporting number of pods as an upper limit?

I'm not sure how I'd be able to set that value reasonably. In a cluster I'd probably have multiple deployments running with HPA behind the scenes so it'll be difficult for me to guess an appropriate bound. I also think since Karpenter provisions your infrastructure, its limits being modeled as infra characteristics like cpu / mem / gpu would be a little more intuitive too. Limiting the number of pods could be left to pod-autoscalers like HPA?

Also, if we're going to do absolute limits should we break it out by capacity type?

Ooh this would be nice. I know we just made capacity-type a well known concept too so I wouldn't be against it. Although I suppose we could be even more generic where you could also break it down by other well known constraints like availability-zone? With AZs, we could kind of enforce Karpenter does some form of balancing across zones for higher availability. I wonder if this is too much complexity to start with?

I think the balancing is probably a bit advanced and maybe overloads this limits concept a bit, but I think factoring in having the limit be per capacity type or across all capacity types is still reasonable. Especially since a big part of why this limit will exist is to control costs and different capacity types can (and in AWS's case certainly do) have drastic cost implications.

Yeah that's fair on the balancing bit. Following up on the conversation in the working group meeting though, this might be a little difficult to represent and use if we keep adding new dimensions to split the limits. Even with say capacity type, it might end up looking something like -

limits: resources: cpu: 1000 # irrespective of capacity type? capacity-type: on-demand cpu: 1000 capacity-type: spot cpu: 500 or limits: resources: cpu: capacity-types: spot: 500 on-demand: 1000 mem: ....

And it'd start to get more complicated from there. I think to the other feedback that these limits are more to bound any scaling when something seems really off and isn't so much about precise cost control, I think we could probably leave this optimization out. Does that sound reasonable to you?

…ws#788)

suket22 added 2 commits November 9, 2021 13:37

Adds document for Karpenter absolute limits

908210e

Some more edits

d012aca

bwagner5 changed the title ~~Adds doc to discuss Karpenter limits~~ Adds design doc to discuss Karpenter limits Nov 10, 2021

bwagner5 added the needs-design Design required label Nov 10, 2021

JacobGabrielson reviewed Nov 10, 2021

View reviewed changes

designs/limits.md Outdated Show resolved Hide resolved

designs/limits.md Show resolved Hide resolved

designs/limits.md Outdated Show resolved Hide resolved

designs/limits.md Show resolved Hide resolved

bwagner5 reviewed Nov 10, 2021

View reviewed changes

designs/limits.md Outdated Show resolved Hide resolved

bwagner5 reviewed Nov 10, 2021

View reviewed changes

eptiger reviewed Nov 10, 2021

View reviewed changes

Editing some more - addressing feedback

b2f9d2f

ellistarn approved these changes Nov 18, 2021

View reviewed changes

suket22 merged commit 2f18231 into aws:main Nov 19, 2021

suket22 deleted the limitsDesign branch November 19, 2021 00:01

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

chore: Drop nodeclaim.Key from provisioner, disruption, and state (a…

73035bb

…ws#788)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds design doc to discuss Karpenter limits #788

Adds design doc to discuss Karpenter limits #788

suket22 commented Nov 9, 2021

netlify bot commented Nov 9, 2021 •

edited

Loading

bwagner5 left a comment

bwagner5 Nov 10, 2021

suket22 Nov 10, 2021

suket22 Nov 11, 2021

eptiger Nov 10, 2021

eptiger Nov 10, 2021

suket22 Nov 10, 2021 •

edited

Loading

eptiger Nov 11, 2021

suket22 Nov 11, 2021


		This is a much simpler semantic to follow, where you provide an upper cap on the amount of resources being provisioned irrespective of their health. This is more in-line with alternative solutions like CA and should be easier to reason with.

		The actual limit of the number of resources being spun up can be defined either through something similar to resource requests or just a flat count of instances.

Adds design doc to discuss Karpenter limits #788

Adds design doc to discuss Karpenter limits #788

Conversation

suket22 commented Nov 9, 2021

netlify bot commented Nov 9, 2021 • edited Loading

bwagner5 left a comment

Choose a reason for hiding this comment

bwagner5 Nov 10, 2021

Choose a reason for hiding this comment

suket22 Nov 10, 2021

Choose a reason for hiding this comment

suket22 Nov 11, 2021

Choose a reason for hiding this comment

eptiger Nov 10, 2021

Choose a reason for hiding this comment

eptiger Nov 10, 2021

Choose a reason for hiding this comment

suket22 Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

eptiger Nov 11, 2021

Choose a reason for hiding this comment

suket22 Nov 11, 2021

Choose a reason for hiding this comment

netlify bot commented Nov 9, 2021 •

edited

Loading

suket22 Nov 10, 2021 •

edited

Loading