Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ICE caching for AWS cloud provider and fix bug in filtering for SubnetProvider #816

Merged
merged 11 commits into from
Nov 25, 2021

Conversation

eptiger
Copy link
Contributor

@eptiger eptiger commented Nov 19, 2021

1. Issue, if available:
#371

2. Description of changes:
Finally - the exhilirating conclusion to my set of code changes on ice caching! All the changes I've made fit nicely within the AWS cloud provider

  • I added a cache in the InstanceProvider. I debated a bunch with myself for where this should live and decided that it was cleaner to do that than to have the InstanceProvider pass the ICEs back to the cloudprovider to manage it. I liked the idea that the InstanceProvider would encapsulate the interactions with Fleet.
  • After a discussion with Ellis regarding my confusion on the constraints being passed into GetInstanceTypes, I removed the erroneous filtering in the SubnetProvider that was constraining AvailabilityZones (the only place the filtering was supposed to really happen was in the InstanceProvider when creating LT overrides for launch, so I moved the filtering into the InstanceProvider exclusively). I added brief comments to help others understand this better.
  • I hope it's not too controversial that I added AvailabilityZone to our LT overrides. It's redundant but it makes our lives easier for processing ICEs and given that we're not near our Fleet request size limits it feels pretty safe.

I haven't done real cluster testing yet, only unit tests (but I believe Brandon has). Will do a sanity test on a real cluster before pushing. I also didn't add a unit test yet where we fall back to another zone - I may still try to add this when addressing the next round of comments.

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@eptiger eptiger requested a review from ellistarn November 19, 2021 08:11
@netlify
Copy link

netlify bot commented Nov 19, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 408a81a

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/619f044900afa600076d5236

}
if len(cleanedOfferings) > 0 {
instanceType.AvailableOfferings = cleanedOfferings
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax seemed funky to me but doing I get a linting error if I just append *instanceType - there's certainly some Go idiosyncrasy I don't follow here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the issue I mentioned above where the range variable in a for loop uses one memory address and overrides it for each iteration. So you'd need to set another temp variable in the loop, and take the address of it when appending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I've fixed this in my new commit - please let me know if there's a cleaner way that I'm spacing on for how to do this.

pkg/cloudprovider/aws/cloudprovider.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/errors.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/errors.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/instance.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/instance.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/instance.go Outdated Show resolved Hide resolved
}
if len(cleanedOfferings) > 0 {
instanceType.AvailableOfferings = cleanedOfferings
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the issue I mentioned above where the range variable in a for loop uses one memory address and overrides it for each iteration. So you'd need to set another temp variable in the loop, and take the address of it when appending.

pkg/cloudprovider/aws/instance.go Outdated Show resolved Hide resolved
if err != nil {
return nil, err
}
return c.instanceProvider.DiscardICEdInstanceTypes(ctx, allInstanceTypes), nil
Copy link
Contributor

@ellistarn ellistarn Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is the only place that this is used, does it make sense to hide this behavior inside the instance provider? That is to say, instanceTypeProvider.Get() could inherently factor in unavailable offerings. Is there ever a case where we want to include unavailable offerings on this call?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but it felt messy as I would need to get the cached instance types from instance provider and then hand it to instance type provider (like an exclusion list). Unless you had something more clever in mind? I liked the idea that the ICE cache was entirely encapsulated by the instance provider, but if folks feel strongly I'm happy to disagree and commit if you want to share how you think it should look.

In terms of possibly including unavailable offerings - I can't think of why we'd want to do that. If you have something in mind, happy to hear it out. We would have to modify the overall CloudProvider interface if wanted to make ice caching behavior optional, which then leaks that detail into the higher level abstraction, so I'm inclined to say that we probably wouldn't want to do that.

@@ -36,11 +40,30 @@ import (
"github.com/awslabs/karpenter/pkg/utils/options"
)

const (
iceCacheODTTL = 15 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
iceCacheODTTL = 15 * time.Second
InsufficientCapacityExceptionCacheOnDemandTTL = 15 * time.Second

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the spot and OD cache TTLs should be the same 45 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the value - do you want to chat more about this over Slack? I can disagree and commit on this, but I like the idea that we control them separately and that they are actually different values.

Re: the name - just so I understand the repo style for future changes, can I get clarity on when to use short versus long variable names? I took it that the convention in Go was to use short variable names whenever possible, but I must have misunderstood this convention. Is it because ICE is an EC2 acronym whereas TTL is more widely used?

@@ -105,6 +128,37 @@ func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error {
return nil
}

// The intention is to remove Offerings from the provided possible InstanceTypes based on recently observed ICEs, which will
// indirectly impact the packer to keep it from spinning its wheels on packings that are doomed to fail due to EC2 capacity constraints.
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming suggestion:

Suggested change
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType {
func (p *InstanceProvider) WithoutUnavailableOfferings(ctx context.Context, instanceTypes []InstanceType) []InstanceType {

It also might be a bit more idiomatic to do type conversion at the cloud provider interface and just think in terms of aws.InstanceTypes here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're probably right, it means that we'll have to loop through all the InstanceType instances in the cloud provider to convert them, IIUC - right? I didn't do this because it seemed like extra effort I could eliminate by doing the conversion here. If you think being idiomatic is more valuable than the tiny performance saving, I can change it - just LMK.

@@ -304,3 +385,7 @@ func combineReservations(reservations []*ec2.Reservation) []*ec2.Instance {
}
return instances
}

func createIceCacheKey(instanceType string, capacityType string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func createIceCacheKey(instanceType string, capacityType string) string {
func iceCacheKey(instanceType string, capacityType string) string {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with unavailableOfferingsCacheKey

type InstanceProvider struct {
ec2api ec2iface.EC2API
instanceTypeProvider *InstanceTypeProvider
subnetProvider *SubnetProvider
launchTemplateProvider *LaunchTemplateProvider
iceCache *cache.Cache
Copy link
Contributor

@ellistarn ellistarn Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional name suggestion

Suggested change
iceCache *cache.Cache
offeringAvailability *cache.Cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about unavailableOfferings?

@@ -176,7 +237,7 @@ func (p *InstanceProvider) getLaunchTemplateConfigs(ctx context.Context, constra
return launchTemplateConfigs, nil
}

func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest {
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, constrainedZones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, constrainedZones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest {
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, zones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna push back on this comment. I appreciate that we're trying to be concise, but there's 3 kinds of zones here:

  1. The zones in the instance type offerings representing what is supported
  2. The zones associated with the constraints that constrain what we can choose
  3. The zone associated with the subnet

I think there's value in naming them separately as there is legitimate ambiguity. Do you feel strongly about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a simplification below that might help with this, but I'm happy either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compromise: I'm adding a comment because I still think zones is confusing =P You can take your pick between the new comment or me reverting the variable name change, I think we need at least one of the two.

@@ -185,10 +246,14 @@ func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.Inst
continue
}
for _, subnet := range subnets {
if aws.StringValue(subnet.AvailabilityZone) == offering.Zone {
subnetZone := aws.StringValue(subnet.AvailabilityZone)
if subnetZone == offering.Zone && (constrainedZones == nil || constrainedZones.Has(subnetZone)) {
Copy link
Contributor

@ellistarn ellistarn Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're guaranteed that constrainedZones won't be nil, since Karpenter has to produce the full set of viable zones for the requirement NotIn operator to work. e.g. us-west-2a not in nil is impossible to compute.

Suggested change
if subnetZone == offering.Zone && (constrainedZones == nil || constrainedZones.Has(subnetZone)) {
if zone := aws.StringValue(subnet.AvailabilityZone); zone == offering.Zone && zones.has(zone) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are saying it's always guaranteed, I'll go ahead and remove the check. I'm hoping that we already have some unit tests that would fail in this case?

zones.(sets.String).Insert(zone)
p.iceCache.Set(cacheKey, zones, cacheTTL)
} else {
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to include the zone in the cache?

Suggested change
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL)
p.iceCache.Set(cacheKey, sets.String{zone: sets.NewString(zones)}, cacheTTL)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe I'm confused but I thought a sets.String was a hash map of strings to empty values. Looking at the source code, the insert does this:

// Insert adds items to the set.
func (s String) Insert(items ...string) String {
	for _, item := range items {
		s[item] = Empty{}
	}
	return s
}

Am I missing something?

}
cacheKey := createIceCacheKey(instanceType, capacityType)
if zones, exists := p.iceCache.Get(cacheKey); exists {
zones.(sets.String).Insert(zone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because sets.String is a pointer (the underlying type is a map, which is a pointer), you should be able to just read and inject the zone and not call cache.Set() again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming that - this was admittedly me being lazy about not verifying this suspicion I had =P That being said, don't I need to do a Set() to update the TTL?

cacheKey := createIceCacheKey(instanceType, capacityType)
if zones, exists := p.iceCache.Get(cacheKey); exists {
zones.(sets.String).Insert(zone)
p.iceCache.Set(cacheKey, zones, cacheTTL)
Copy link
Contributor

@ellistarn ellistarn Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, I think there might be a weird case

  • zone-a is added to the cache, 5m
  • 4 minutes pass
  • zone-b is added to the cache, 5m
  • both zone-a and zone-b are now cached for 5m
  • 1 minute passes
  • zone-a might be available again, but is cached for 4 more minutes

You could potentially store a timestamp in the cache to keep track of this, but it feels a bit complicated to me. It's probably fine in the grand scheme. Happy to discuss this more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I, too, thought of this and decided it was an edge case not worth trying to fix.

The more I thought about it after reading your comment though the more it bothered me. I think I've come up with a solution that isn't perfect but balances ease of use with users not hitting an awful edge case. The fact that I can't make a set of structs in Go makes this very hard to solve otherwise is a nice way.

@@ -75,6 +76,7 @@ var _ = BeforeSuite(func() {
securityGroupProvider: NewSecurityGroupProvider(fakeEC2API),
cache: launchTemplateCache,
},
cache.New(5*time.Second, 30*time.Second),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you export these values you can reference them directly in the tests. Assuming they're not different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll also likely need to use injectable.Time to test this, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will play with this when I add tests. I thought lower values were better for testing, unless you think testing TTL is excessive.

Copy link
Contributor Author

@eptiger eptiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About to post the next update, just give me a few minutes to run through the checks.

if err != nil {
return nil, err
}
return c.instanceProvider.DiscardICEdInstanceTypes(ctx, allInstanceTypes), nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered that, but it felt messy as I would need to get the cached instance types from instance provider and then hand it to instance type provider (like an exclusion list). Unless you had something more clever in mind? I liked the idea that the ICE cache was entirely encapsulated by the instance provider, but if folks feel strongly I'm happy to disagree and commit if you want to share how you think it should look.

In terms of possibly including unavailable offerings - I can't think of why we'd want to do that. If you have something in mind, happy to hear it out. We would have to modify the overall CloudProvider interface if wanted to make ice caching behavior optional, which then leaks that detail into the higher level abstraction, so I'm inclined to say that we probably wouldn't want to do that.

pkg/cloudprovider/aws/errors.go Outdated Show resolved Hide resolved
@@ -36,11 +40,30 @@ import (
"github.com/awslabs/karpenter/pkg/utils/options"
)

const (
iceCacheODTTL = 15 * time.Second
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: the value - do you want to chat more about this over Slack? I can disagree and commit on this, but I like the idea that we control them separately and that they are actually different values.

Re: the name - just so I understand the repo style for future changes, can I get clarity on when to use short versus long variable names? I took it that the convention in Go was to use short variable names whenever possible, but I must have misunderstood this convention. Is it because ICE is an EC2 acronym whereas TTL is more widely used?

type InstanceProvider struct {
ec2api ec2iface.EC2API
instanceTypeProvider *InstanceTypeProvider
subnetProvider *SubnetProvider
launchTemplateProvider *LaunchTemplateProvider
iceCache *cache.Cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about unavailableOfferings?

@@ -105,6 +128,37 @@ func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error {
return nil
}

// The intention is to remove Offerings from the provided possible InstanceTypes based on recently observed ICEs, which will
// indirectly impact the packer to keep it from spinning its wheels on packings that are doomed to fail due to EC2 capacity constraints.
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're probably right, it means that we'll have to loop through all the InstanceType instances in the cloud provider to convert them, IIUC - right? I didn't do this because it seemed like extra effort I could eliminate by doing the conversion here. If you think being idiomatic is more valuable than the tiny performance saving, I can change it - just LMK.

@@ -304,3 +385,7 @@ func combineReservations(reservations []*ec2.Reservation) []*ec2.Instance {
}
return instances
}

func createIceCacheKey(instanceType string, capacityType string) string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with unavailableOfferingsCacheKey

}
if len(cleanedOfferings) > 0 {
instanceType.AvailableOfferings = cleanedOfferings
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I've fixed this in my new commit - please let me know if there's a cleaner way that I'm spacing on for how to do this.

zones.(sets.String).Insert(zone)
p.iceCache.Set(cacheKey, zones, cacheTTL)
} else {
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe I'm confused but I thought a sets.String was a hash map of strings to empty values. Looking at the source code, the insert does this:

// Insert adds items to the set.
func (s String) Insert(items ...string) String {
	for _, item := range items {
		s[item] = Empty{}
	}
	return s
}

Am I missing something?

@@ -75,6 +76,7 @@ var _ = BeforeSuite(func() {
securityGroupProvider: NewSecurityGroupProvider(fakeEC2API),
cache: launchTemplateCache,
},
cache.New(5*time.Second, 30*time.Second),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will play with this when I add tests. I thought lower values were better for testing, unless you think testing TTL is excessive.

cacheKey := createIceCacheKey(instanceType, capacityType)
if zones, exists := p.iceCache.Get(cacheKey); exists {
zones.(sets.String).Insert(zone)
p.iceCache.Set(cacheKey, zones, cacheTTL)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I, too, thought of this and decided it was an edge case not worth trying to fix.

The more I thought about it after reading your comment though the more it bothered me. I think I've come up with a solution that isn't perfect but balances ease of use with users not hitting an awful edge case. The fact that I can't make a set of structs in Go makes this very hard to solve otherwise is a nice way.

}
Expect(len(nodeNames)).To(Equal(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test binpacking elsewhere, so you could reduce the scope of this test to not test len(nodes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't feel strongly about it, I'd rather keep these checks. I don't think it hurts to sanity test that the test is working as the author expected.

@ellistarn ellistarn merged commit 0ad68e8 into aws:main Nov 25, 2021
@eptiger eptiger deleted the ice-cache-3 branch November 25, 2021 05:02
return &ec2.CreateFleetOutput{Instances: []*ec2.CreateFleetInstance{{InstanceIds: instanceIds}}}, nil
result := &ec2.CreateFleetOutput{
Instances: []*ec2.CreateFleetInstance{{InstanceIds: instanceIds}}}
if len(skippedPools) > 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, meant to remove this. Sorry :)

gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants