Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Offerings to InstanceType to hold zone and capacity type, remove OS from InstanceType, promote capacity-type label #780

Merged
merged 11 commits into from
Nov 10, 2021

Conversation

eptiger
Copy link
Contributor

@eptiger eptiger commented Nov 4, 2021

1. Issue, if available:
Part of #371

2. Description of changes:

  • OS is not an instance type level concept - this seems to have been shoehorned in and causes unnecessary confusion. Removing it until we figure out proper support for OS constraints.
  • When we do ICE caching we need to be able to filter out combinations of zone and capacity type when we see ICEs. Since packing otherwise does not care about them it seemed preferable to put a slice of these tuples ("Offerings") inside InstanceType rather than having to permute all the possible combinations in bin packing (i.e. if there are 3 zones and 2 capacity types then we can still consider just a single InstanceType for an actual instance type with this Offering re-factoring rather than having to create 5 InstanceType instances in the status quo to cover the permutations of zone and capacity type).
  • Added a note on tailing logs to the dev guide. Maybe obvious to those of you very familiar with K8s, but was not obvious to me =P
  • In talking to Ellis, decided to bite the bullet and re-name the capacity-type label to make it a first class citizen rather than being relegated to the AWS cloud provider. This simplifies some of the scheduler code a little and honestly just makes more sense.

I'm not entirely sure if I need to add additional unit tests at this time - I think we're covered with what's there but open to feedback.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Sorry, something went wrong.

@netlify
Copy link

netlify bot commented Nov 4, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 4652547

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/618c0f66cd96c400085956c3

@eptiger eptiger changed the title Add Offerings to InstanceType to hold zone and capacity type, remove OS Add Offerings to InstanceType to hold zone and capacity type, remove OS from InstanceType, promote capacity-type label Nov 4, 2021
for _, offering := range instanceType.Offerings() {
zone := offering.Zone
// we can't assume that all zones will be available for all capacity types, hence this check
// (we shouldn't ever have duplicate zones, just being paranoid to create a well-formed set of overrides)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of check is like wearing kevlar in your own living room in case you accidentally shoot yourself. There's no customer exposed surface here and Offerings() is a set (right?). IMO we should skip the zone uniqueness check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offerings() is not a set, it's a slice, so I fundamentally/philosophically disagree with you here. A duplicate is very much possible. A bug could easily be introduced where we have duplicate offerings. Because there's no constructor like Java, it's not like I can validate the initialization of the InstanceType struct (correct me if I'm wrong here), so there's nothing to maintain the invariant that the Offerings are all unique, unless I invent a set type specifically for Offering, which seems like way more work than adding 2 lines of code here.

The downside if I'm wrong and we never have a duplicate is we have an extra set that has a handful of elements in it and an extra conditional check. The downside if you're wrong is that users will fail to scale out because of invalid LT overrides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear ya'. I don't think there's going to be a great solution to this type of thing until Go adds generics to the language, at which point it would be pretty easy to make Offerings() return a set. FWIW, I think that the "house style" of a lot of Go code (in Kubernetes land) is to assume validation has occurred elsewhere and not keep checking invariants. Personally, I would suggest you just make the call here whether to keep or not, although I would probably remove the check myself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ellis and I came up with a good compromise - I'm going to move the validation to the instance type provider. Update coming momentarily.

Team work makes the dream work 🌈

@@ -65,16 +65,15 @@ IDs.](https://docs.aws.amazon.com/ram/latest/userguide/working-with-az-ids.html)

### Capacity Type

- key: `node.k8s.aws/capacity-type`
- key: `karpenter.sh/capacity-type`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be a lil' careful on updating docs before releasing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following your comment here. Are you saying I shouldn't update the docs? Or is there some other issue with my doc update?

This change is essentially backwards incompatible so it's important IMO that the docs match the functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are published immediately when the PR is merged, but we won't release a new version immediately when merged. So the docs will be updated with the new karpenter.sh/capacity-type but our latest published container image will not include it. Anyone following the docs between that time will get errors.

It might be better to open a second PR w/ the docs changes and label it with the release to coordinate the docs release at the same time as the artifact releases? We don't have a great way of handling it today :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a PITA. Should we have a doc branch then? I don't know what to do with this second PR exactly - what does approval of it mean, because we can't pull it until we're ready to cut a release, right? And then, if there are multiple doc changes from multiple PRs someone has to go through the fun exercise of resolving conflicts. Am I missing something?

Is there some special syntax with labeling the commit? Do you mean just putting v0.5 in the commit message? Or do you mean adding the label in the PR in the GitHib UI? If so, I also have no idea how to do this - on this very PR on the right side the labels say "None yet" and there's no edit button or anything.

I'll back this change out, but I don't know how we make sure everyone currently contributing to Karpenter knows about this in the interim because it is a backwards incompatible change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: labeling - am I supposed to do something like this?

git tag v0.4.2 <commit_sha>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant in the github UI, you should have labeling permissions. Maybe we should just cut a v0.4.3 release when this gets merged since we are in a state of rapid iteration here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure I don't have those permissions, how can I get them? I should already be in awslabs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eptiger I just made you an Admin on this repo, so hopefully that'll work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh Brandon gave me the permissions I needed, but I don't see a label for v0.4.3 yet - should I create one?

@geoffcline
Copy link
Contributor

I appreciate you including docs changes, it's helping me keep the thread.

func ExpectPodNotScheduled(c client.Client, name string, namespace string) {
pod := &v1.Pod{}
Expect(c.Get(context.Background(), client.ObjectKey{Name: name, Namespace: namespace}, pod)).To(Succeed())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to fully review this. Feel free to push forward w/ Jacob's review. I agree with the high level direction.

@@ -59,6 +60,15 @@ func ExpectNotFound(c client.Client, objects ...client.Object) {
}
}

func ExpectPodNotScheduled(c client.Client, name string, namespace string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling only happens to pods, so you could simplify to ExpectNotScheduled

@@ -59,6 +60,15 @@ func ExpectNotFound(c client.Client, objects ...client.Object) {
}
}

func ExpectPodNotScheduled(c client.Client, name string, namespace string) {
pod := &v1.Pod{}
Expect(c.Get(context.Background(), client.ObjectKey{Name: name, Namespace: namespace}, pod)).To(Succeed())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use pod = ExpectPodExists() here

@bwagner5 bwagner5 mentioned this pull request Nov 9, 2021
3 tasks
bwagner5
bwagner5 previously approved these changes Nov 9, 2021
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

//TODO filter out possible zones and capacity types using an ICE cache https://github.com/awslabs/karpenter/issues/371
offerings := []cloudprovider.Offering{}
for zone := range subnetZones.Intersection(instanceTypeZones[instanceType.Name()]) {
for _, capacityType := range instanceType.SupportedUsageClasses {
Copy link
Contributor

@ellistarn ellistarn Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sugg: may want to protect against uniqueness here
for capacityType := range sets.NewString(aws.StringValue(instanceType.SupportedUsageClasses)...)

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one final comment

ellistarn
ellistarn previously approved these changes Nov 10, 2021
Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I noticed a couple of optional things. You should feel totally free to merge as is. Happy to approve again in case you want to tweak.

@@ -172,7 +172,12 @@ func (p *InstanceProvider) getLaunchTemplateConfigs(ctx context.Context, constra
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest {
var overrides []*ec2.FleetLaunchTemplateOverridesRequest
for i, instanceType := range instanceTypeOptions {
for zone := range instanceType.Zones() {
for _, offering := range instanceType.Offerings() {
zone := offering.Zone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: I might just inline this where its used. I don't think there's much of a perf angle, and it adds an extra concept to think about as you read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was an artifact of the uniqueness check

if constraints.Requirements.CapacityTypes().Has(o.CapacityType) {
if constraints.Requirements.Zones().Has(o.Zone) {
zone = o.Zone
capacityType = o.CapacityType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to break here? Does it matter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot yeah, good catch. I mean it ultimately doesn't matter, but yeah I was supposed to. I don't know how I missed that =P

amdGPUs resource.Quantity
awsNeurons resource.Quantity
name string
zones sets.String
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to remove zone/capacityTypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Darn it, in the midst of test debugging I missed this - thanks!

if schedule.Requirements.OperatingSystems().Intersection(p.OperatingSystems()).Len() == 0 {
return fmt.Errorf("operating system %s is not in %v", p.OperatingSystems(), schedule.Requirements.OperatingSystems().List())
func (p *Packable) validateZones(schedule *scheduling.Schedule) error {
packableZones := sets.String{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional:

Not sure if I mentioned earlier, but go style would be to let the "packable" piece of this variable be inferred by context of it being within a Packable struct. This is known as "stuttering". Sometimes if you have multiple concepts that collide (i.e. two different sources of zone), you need to get a bit clever about naming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is I think actually an artifact of me doing something stupidly complicated here where I had two sets of zones and had to differentiate them. Fixed.

ellistarn
ellistarn previously approved these changes Nov 10, 2021
@eptiger eptiger merged commit 68e1981 into aws:main Nov 10, 2021
@eptiger eptiger deleted the ice-cache-2 branch November 10, 2021 18:45
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants