-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ICE caching for AWS cloud provider and fix bug in filtering for SubnetProvider #816
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: 408a81a 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/619f044900afa600076d5236 |
pkg/cloudprovider/aws/instance.go
Outdated
} | ||
if len(cleanedOfferings) > 0 { | ||
instanceType.AvailableOfferings = cleanedOfferings | ||
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This syntax seemed funky to me but doing I get a linting error if I just append *instanceType
- there's certainly some Go idiosyncrasy I don't follow here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the issue I mentioned above where the range variable in a for loop uses one memory address and overrides it for each iteration. So you'd need to set another temp variable in the loop, and take the address of it when appending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I've fixed this in my new commit - please let me know if there's a cleaner way that I'm spacing on for how to do this.
pkg/cloudprovider/aws/instance.go
Outdated
} | ||
if len(cleanedOfferings) > 0 { | ||
instanceType.AvailableOfferings = cleanedOfferings | ||
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the issue I mentioned above where the range variable in a for loop uses one memory address and overrides it for each iteration. So you'd need to set another temp variable in the loop, and take the address of it when appending.
if err != nil { | ||
return nil, err | ||
} | ||
return c.instanceProvider.DiscardICEdInstanceTypes(ctx, allInstanceTypes), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is the only place that this is used, does it make sense to hide this behavior inside the instance provider? That is to say, instanceTypeProvider.Get() could inherently factor in unavailable offerings. Is there ever a case where we want to include unavailable offerings on this call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but it felt messy as I would need to get the cached instance types from instance provider and then hand it to instance type provider (like an exclusion list). Unless you had something more clever in mind? I liked the idea that the ICE cache was entirely encapsulated by the instance provider, but if folks feel strongly I'm happy to disagree and commit if you want to share how you think it should look.
In terms of possibly including unavailable offerings - I can't think of why we'd want to do that. If you have something in mind, happy to hear it out. We would have to modify the overall CloudProvider interface if wanted to make ice caching behavior optional, which then leaks that detail into the higher level abstraction, so I'm inclined to say that we probably wouldn't want to do that.
pkg/cloudprovider/aws/instance.go
Outdated
@@ -36,11 +40,30 @@ import ( | |||
"github.com/awslabs/karpenter/pkg/utils/options" | |||
) | |||
|
|||
const ( | |||
iceCacheODTTL = 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iceCacheODTTL = 15 * time.Second | |
InsufficientCapacityExceptionCacheOnDemandTTL = 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the spot and OD cache TTLs should be the same 45 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the value - do you want to chat more about this over Slack? I can disagree and commit on this, but I like the idea that we control them separately and that they are actually different values.
Re: the name - just so I understand the repo style for future changes, can I get clarity on when to use short versus long variable names? I took it that the convention in Go was to use short variable names whenever possible, but I must have misunderstood this convention. Is it because ICE is an EC2 acronym whereas TTL is more widely used?
pkg/cloudprovider/aws/instance.go
Outdated
@@ -105,6 +128,37 @@ func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error { | |||
return nil | |||
} | |||
|
|||
// The intention is to remove Offerings from the provided possible InstanceTypes based on recently observed ICEs, which will | |||
// indirectly impact the packer to keep it from spinning its wheels on packings that are doomed to fail due to EC2 capacity constraints. | |||
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming suggestion:
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType { | |
func (p *InstanceProvider) WithoutUnavailableOfferings(ctx context.Context, instanceTypes []InstanceType) []InstanceType { |
It also might be a bit more idiomatic to do type conversion at the cloud provider interface and just think in terms of aws.InstanceTypes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're probably right, it means that we'll have to loop through all the InstanceType instances in the cloud provider to convert them, IIUC - right? I didn't do this because it seemed like extra effort I could eliminate by doing the conversion here. If you think being idiomatic is more valuable than the tiny performance saving, I can change it - just LMK.
pkg/cloudprovider/aws/instance.go
Outdated
@@ -304,3 +385,7 @@ func combineReservations(reservations []*ec2.Reservation) []*ec2.Instance { | |||
} | |||
return instances | |||
} | |||
|
|||
func createIceCacheKey(instanceType string, capacityType string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func createIceCacheKey(instanceType string, capacityType string) string { | |
func iceCacheKey(instanceType string, capacityType string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with unavailableOfferingsCacheKey
pkg/cloudprovider/aws/instance.go
Outdated
type InstanceProvider struct { | ||
ec2api ec2iface.EC2API | ||
instanceTypeProvider *InstanceTypeProvider | ||
subnetProvider *SubnetProvider | ||
launchTemplateProvider *LaunchTemplateProvider | ||
iceCache *cache.Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional name suggestion
iceCache *cache.Cache | |
offeringAvailability *cache.Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about unavailableOfferings
?
pkg/cloudprovider/aws/instance.go
Outdated
@@ -176,7 +237,7 @@ func (p *InstanceProvider) getLaunchTemplateConfigs(ctx context.Context, constra | |||
return launchTemplateConfigs, nil | |||
} | |||
|
|||
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest { | |||
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, constrainedZones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, constrainedZones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest { | |
func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.InstanceType, subnets []*ec2.Subnet, zones sets.String, capacityType string) []*ec2.FleetLaunchTemplateOverridesRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna push back on this comment. I appreciate that we're trying to be concise, but there's 3 kinds of zones here:
- The zones in the instance type offerings representing what is supported
- The zones associated with the constraints that constrain what we can choose
- The zone associated with the subnet
I think there's value in naming them separately as there is legitimate ambiguity. Do you feel strongly about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a simplification below that might help with this, but I'm happy either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compromise: I'm adding a comment because I still think zones
is confusing =P You can take your pick between the new comment or me reverting the variable name change, I think we need at least one of the two.
pkg/cloudprovider/aws/instance.go
Outdated
@@ -185,10 +246,14 @@ func (p *InstanceProvider) getOverrides(instanceTypeOptions []cloudprovider.Inst | |||
continue | |||
} | |||
for _, subnet := range subnets { | |||
if aws.StringValue(subnet.AvailabilityZone) == offering.Zone { | |||
subnetZone := aws.StringValue(subnet.AvailabilityZone) | |||
if subnetZone == offering.Zone && (constrainedZones == nil || constrainedZones.Has(subnetZone)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're guaranteed that constrainedZones won't be nil, since Karpenter has to produce the full set of viable zones for the requirement NotIn
operator to work. e.g. us-west-2a
not in nil
is impossible to compute.
if subnetZone == offering.Zone && (constrainedZones == nil || constrainedZones.Has(subnetZone)) { | |
if zone := aws.StringValue(subnet.AvailabilityZone); zone == offering.Zone && zones.has(zone) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are saying it's always guaranteed, I'll go ahead and remove the check. I'm hoping that we already have some unit tests that would fail in this case?
pkg/cloudprovider/aws/instance.go
Outdated
zones.(sets.String).Insert(zone) | ||
p.iceCache.Set(cacheKey, zones, cacheTTL) | ||
} else { | ||
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to include the zone in the cache?
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL) | |
p.iceCache.Set(cacheKey, sets.String{zone: sets.NewString(zones)}, cacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe I'm confused but I thought a sets.String was a hash map of strings to empty values. Looking at the source code, the insert does this:
// Insert adds items to the set.
func (s String) Insert(items ...string) String {
for _, item := range items {
s[item] = Empty{}
}
return s
}
Am I missing something?
pkg/cloudprovider/aws/instance.go
Outdated
} | ||
cacheKey := createIceCacheKey(instanceType, capacityType) | ||
if zones, exists := p.iceCache.Get(cacheKey); exists { | ||
zones.(sets.String).Insert(zone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because sets.String is a pointer (the underlying type is a map, which is a pointer), you should be able to just read and inject the zone and not call cache.Set() again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming that - this was admittedly me being lazy about not verifying this suspicion I had =P That being said, don't I need to do a Set()
to update the TTL?
pkg/cloudprovider/aws/instance.go
Outdated
cacheKey := createIceCacheKey(instanceType, capacityType) | ||
if zones, exists := p.iceCache.Get(cacheKey); exists { | ||
zones.(sets.String).Insert(zone) | ||
p.iceCache.Set(cacheKey, zones, cacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, I think there might be a weird case
- zone-a is added to the cache, 5m
- 4 minutes pass
- zone-b is added to the cache, 5m
- both zone-a and zone-b are now cached for 5m
- 1 minute passes
- zone-a might be available again, but is cached for 4 more minutes
You could potentially store a timestamp in the cache to keep track of this, but it feels a bit complicated to me. It's probably fine in the grand scheme. Happy to discuss this more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I, too, thought of this and decided it was an edge case not worth trying to fix.
The more I thought about it after reading your comment though the more it bothered me. I think I've come up with a solution that isn't perfect but balances ease of use with users not hitting an awful edge case. The fact that I can't make a set of structs in Go makes this very hard to solve otherwise is a nice way.
pkg/cloudprovider/aws/suite_test.go
Outdated
@@ -75,6 +76,7 @@ var _ = BeforeSuite(func() { | |||
securityGroupProvider: NewSecurityGroupProvider(fakeEC2API), | |||
cache: launchTemplateCache, | |||
}, | |||
cache.New(5*time.Second, 30*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you export these values you can reference them directly in the tests. Assuming they're not different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also likely need to use injectable.Time to test this, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will play with this when I add tests. I thought lower values were better for testing, unless you think testing TTL is excessive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About to post the next update, just give me a few minutes to run through the checks.
if err != nil { | ||
return nil, err | ||
} | ||
return c.instanceProvider.DiscardICEdInstanceTypes(ctx, allInstanceTypes), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered that, but it felt messy as I would need to get the cached instance types from instance provider and then hand it to instance type provider (like an exclusion list). Unless you had something more clever in mind? I liked the idea that the ICE cache was entirely encapsulated by the instance provider, but if folks feel strongly I'm happy to disagree and commit if you want to share how you think it should look.
In terms of possibly including unavailable offerings - I can't think of why we'd want to do that. If you have something in mind, happy to hear it out. We would have to modify the overall CloudProvider interface if wanted to make ice caching behavior optional, which then leaks that detail into the higher level abstraction, so I'm inclined to say that we probably wouldn't want to do that.
pkg/cloudprovider/aws/instance.go
Outdated
@@ -36,11 +40,30 @@ import ( | |||
"github.com/awslabs/karpenter/pkg/utils/options" | |||
) | |||
|
|||
const ( | |||
iceCacheODTTL = 15 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: the value - do you want to chat more about this over Slack? I can disagree and commit on this, but I like the idea that we control them separately and that they are actually different values.
Re: the name - just so I understand the repo style for future changes, can I get clarity on when to use short versus long variable names? I took it that the convention in Go was to use short variable names whenever possible, but I must have misunderstood this convention. Is it because ICE is an EC2 acronym whereas TTL is more widely used?
pkg/cloudprovider/aws/instance.go
Outdated
type InstanceProvider struct { | ||
ec2api ec2iface.EC2API | ||
instanceTypeProvider *InstanceTypeProvider | ||
subnetProvider *SubnetProvider | ||
launchTemplateProvider *LaunchTemplateProvider | ||
iceCache *cache.Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about unavailableOfferings
?
pkg/cloudprovider/aws/instance.go
Outdated
@@ -105,6 +128,37 @@ func (p *InstanceProvider) Terminate(ctx context.Context, node *v1.Node) error { | |||
return nil | |||
} | |||
|
|||
// The intention is to remove Offerings from the provided possible InstanceTypes based on recently observed ICEs, which will | |||
// indirectly impact the packer to keep it from spinning its wheels on packings that are doomed to fail due to EC2 capacity constraints. | |||
func (p *InstanceProvider) DiscardICEdInstanceTypes(ctx context.Context, instanceTypes []InstanceType) []cloudprovider.InstanceType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're probably right, it means that we'll have to loop through all the InstanceType instances in the cloud provider to convert them, IIUC - right? I didn't do this because it seemed like extra effort I could eliminate by doing the conversion here. If you think being idiomatic is more valuable than the tiny performance saving, I can change it - just LMK.
pkg/cloudprovider/aws/instance.go
Outdated
@@ -304,3 +385,7 @@ func combineReservations(reservations []*ec2.Reservation) []*ec2.Instance { | |||
} | |||
return instances | |||
} | |||
|
|||
func createIceCacheKey(instanceType string, capacityType string) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with unavailableOfferingsCacheKey
pkg/cloudprovider/aws/instance.go
Outdated
} | ||
if len(cleanedOfferings) > 0 { | ||
instanceType.AvailableOfferings = cleanedOfferings | ||
cleanedInstanceTypes = append(cleanedInstanceTypes, &instanceTypes[index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think I've fixed this in my new commit - please let me know if there's a cleaner way that I'm spacing on for how to do this.
pkg/cloudprovider/aws/instance.go
Outdated
zones.(sets.String).Insert(zone) | ||
p.iceCache.Set(cacheKey, zones, cacheTTL) | ||
} else { | ||
p.iceCache.Set(cacheKey, sets.String{zone: sets.Empty{}}, cacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe I'm confused but I thought a sets.String was a hash map of strings to empty values. Looking at the source code, the insert does this:
// Insert adds items to the set.
func (s String) Insert(items ...string) String {
for _, item := range items {
s[item] = Empty{}
}
return s
}
Am I missing something?
pkg/cloudprovider/aws/suite_test.go
Outdated
@@ -75,6 +76,7 @@ var _ = BeforeSuite(func() { | |||
securityGroupProvider: NewSecurityGroupProvider(fakeEC2API), | |||
cache: launchTemplateCache, | |||
}, | |||
cache.New(5*time.Second, 30*time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will play with this when I add tests. I thought lower values were better for testing, unless you think testing TTL is excessive.
pkg/cloudprovider/aws/instance.go
Outdated
cacheKey := createIceCacheKey(instanceType, capacityType) | ||
if zones, exists := p.iceCache.Get(cacheKey); exists { | ||
zones.(sets.String).Insert(zone) | ||
p.iceCache.Set(cacheKey, zones, cacheTTL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I, too, thought of this and decided it was an edge case not worth trying to fix.
The more I thought about it after reading your comment though the more it bothered me. I think I've come up with a solution that isn't perfect but balances ease of use with users not hitting an awful edge case. The fact that I can't make a set of structs in Go makes this very hard to solve otherwise is a nice way.
… for SubnetProvider
pkg/cloudprovider/aws/suite_test.go
Outdated
} | ||
Expect(len(nodeNames)).To(Equal(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We test binpacking elsewhere, so you could reduce the scope of this test to not test len(nodes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't feel strongly about it, I'd rather keep these checks. I don't think it hurts to sanity test that the test is working as the author expected.
…move unnecessary expectation method
return &ec2.CreateFleetOutput{Instances: []*ec2.CreateFleetInstance{{InstanceIds: instanceIds}}}, nil | ||
result := &ec2.CreateFleetOutput{ | ||
Instances: []*ec2.CreateFleetInstance{{InstanceIds: instanceIds}}} | ||
if len(skippedPools) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, meant to remove this. Sorry :)
chore: Update owner alias for ellistarn
1. Issue, if available:
#371
2. Description of changes:
Finally - the exhilirating conclusion to my set of code changes on ice caching! All the changes I've made fit nicely within the AWS cloud provider
I haven't done real cluster testing yet, only unit tests (but I believe Brandon has). Will do a sanity test on a real cluster before pushing. I also didn't add a unit test yet where we fall back to another zone - I may still try to add this when addressing the next round of comments.
3. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.