-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS launch template deletion on cache eviction #1278
Conversation
✔️ Deploy Preview for karpenter-docs-prod ready! 🔨 Explore the source changes: 9f0602d 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/62040bb0ea16dc0008b78b8e 😎 Browse the preview: https://deploy-preview-1278--karpenter-docs-prod.netlify.app |
@@ -48,6 +48,7 @@ Resources: | |||
- ec2:CreateTags | |||
- iam:PassRole | |||
- ec2:TerminateInstances | |||
- ec2:DeleteLaunchTemplate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make some upgrade instructions to help users migrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we definitely should... not sure how we'd structure them. I was thinking release notes, but another way would be adding an upgrade section in the versioned docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want a single command to run which updates my IAM. Maybe cfn deploy works out of the box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah cfn should work fine, but I suspect most users are integrating our template (cfn or terraform or cdk) into their own infrastructure-as-code so there won't be a one-size fits all solution.
@@ -211,6 +218,20 @@ func (p *LaunchTemplateProvider) createLaunchTemplate(ctx context.Context, optio | |||
return output.LaunchTemplate, nil | |||
} | |||
|
|||
func (p *LaunchTemplateProvider) onCacheEvicted(key string, lt interface{}) { | |||
p.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to lock since the cache eviction is already threadsafe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache eviction is threadsafe, however onCacheEvicted is not threadsafe. For example, if an LT is evicted from the cache, ensureLaunchTemplate can be executed that will receive a cache miss, and then query LTs and find the LT that was evicted but has not been deleted yet. After it finds that LT, onCacheEvicted can run (if the lock is removed) and delete the LT before it is used which will propagate as a launch failure in Fleet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock isn't saving us in this case, since the LT creation has already happened. The only thing saving us from this race is the expiration timeout -- I guess this is fine (necessary?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lock does save us in the scenario I mentioned. The LT creation I mentioned would occur in the ensureLaunchTemplate
func which also takes a lock. Locking in both of these funcs ensures that the cache is always consistent with the state of EC2.
} | ||
launchTemplate := lt.(*ec2.LaunchTemplate) | ||
if _, err := p.ec2api.DeleteLaunchTemplate(&ec2.DeleteLaunchTemplateInput{LaunchTemplateId: launchTemplate.LaunchTemplateId}); err != nil { | ||
p.logger.Errorf("Unable to delete launch template, %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we'll want to retry this deletion if we fail? Is this called within a controller somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not retried (doesn't reconcile) since this is only executed on cache eviction, BUT it does rehydrate on startup, so if something did happen, a restart of Karpenter would clean them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
1. Issue, if available:
N/A
2. Description of changes:
3. How was this change tested?
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.