Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove the specialness from GPU requests #1489

Merged
merged 4 commits into from
Mar 21, 2022

Conversation

tzneal
Copy link
Contributor

@tzneal tzneal commented Mar 9, 2022

1. Issue, if available:

N/A

2. Description of changes:

This modifies Karpenter to treat resource requests identically. It
removes the special logic for some AWS specific GPU types and
allows CloudProvider implementers to provide their own custom
resource types along with an ordering to be used as a hint for
binpacking regarding which instance types to prefer.

3. How was this change tested?

Unit tests & deploying GPU/non-GPU workloads. Quota limits prevented GPU instances from being
created, but I could see the requests.

4. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Mar 9, 2022

✅ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 908be5e

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/6238cd1e12a6f200081a3611

pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It is nice to get the AWS stuff out of the core code!

pkg/cloudprovider/aws/instancetype.go Outdated Show resolved Hide resolved
pkg/cloudprovider/types.go Outdated Show resolved Hide resolved
pkg/controllers/provisioning/scheduling/scheduler.go Outdated Show resolved Hide resolved
pkg/utils/resources/resources.go Outdated Show resolved Hide resolved
@tzneal tzneal force-pushed the make-gpu-requests-generic branch 6 times, most recently from 6560c2e to 1e148d7 Compare March 17, 2022 15:29
@@ -58,6 +60,9 @@ const (

func init() {
v1alpha5.NormalizedLabels = functional.UnionStringMaps(v1alpha5.NormalizedLabels, map[string]string{"topology.ebs.csi.aws.com/zone": v1.LabelTopologyZone})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man it would be sweet to register NormalizedLabels in the same way!

@tzneal tzneal force-pushed the make-gpu-requests-generic branch 2 times, most recently from 85cd89a to 2f3a0fe Compare March 18, 2022 17:38
@tzneal tzneal requested review from ellistarn and bwagner5 March 18, 2022 17:41
@tzneal tzneal force-pushed the make-gpu-requests-generic branch 2 times, most recently from dfc4d26 to 0a91fdd Compare March 19, 2022 02:14
}
}

// These are meant to give some relative weighting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me uneasy as a reader. They're meant to do X, but do they? Is it accurate to be more concrete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should just remove that comment entirely. Trying to indicate that it's not a real "price", but people should be able to figure that out easily enough.

// creation unless a pod specifically requests this resource type. This is useful for preventing non-GPU workloads
// from possibly scheduling to more expensive GPU instance type, or from causing a GPU instance type to scale up to
// the next larger type due to a non-GPU workload
ResourceFlagMinimizeUsage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I keep thinking about this complexity. I'm wondering if it's enough to purely delineate on WellKnown vs NotWellKnown resource types? If we know about it (e.g., CPU, Memory, EphemeralStorage) then we can use the ResourceFlagNone behavior. If we don't know about it, we can use ResourceFlagMinimizeUsage behavior. This will enable us to simply not configure any of this (for now).

My guess is that we will need to be more sophisticated with binpacking in the future, but I don't think we have enough use cases to have clarity on what the parameters are. Given that with currently known resource types, the above suggestion is equivalent in implementation.

Happy to move forward in either direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a longer comment below, but I'm starting to think this complexity isn't worth it since kube-scheduler won't follow this logic and we're left with using taints/tolerations again which are already built-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants