Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS launch template deletion on cache eviction #1278

Merged
merged 7 commits into from
Feb 9, 2022
Merged

Conversation

bwagner5
Copy link
Contributor

@bwagner5 bwagner5 commented Feb 4, 2022

1. Issue, if available:
N/A

2. Description of changes:

  • The LaunchTemplate provider in the AWS Cloud Provider will not delete launch templates when they are evicted from the provider's local cache. This helps to keep the number of dynamically generated launch templates under control.

3. How was this change tested?

  • Tested via a manual cluster deployment. I set the Cache TTL and Cache Cleanup Interval to 2 seconds and did a series of launches that would result in different launch templates generated. All launch templates were cleaned up w/ no launch issues.

4. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Feb 4, 2022

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: 9f0602d

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/62040bb0ea16dc0008b78b8e

😎 Browse the preview: https://deploy-preview-1278--karpenter-docs-prod.netlify.app

@bwagner5 bwagner5 requested a review from ellistarn February 7, 2022 17:50
@@ -48,6 +48,7 @@ Resources:
- ec2:CreateTags
- iam:PassRole
- ec2:TerminateInstances
- ec2:DeleteLaunchTemplate
Copy link
Contributor

@ellistarn ellistarn Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make some upgrade instructions to help users migrate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we definitely should... not sure how we'd structure them. I was thinking release notes, but another way would be adding an upgrade section in the versioned docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want a single command to run which updates my IAM. Maybe cfn deploy works out of the box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah cfn should work fine, but I suspect most users are integrating our template (cfn or terraform or cdk) into their own infrastructure-as-code so there won't be a one-size fits all solution.

@@ -211,6 +218,20 @@ func (p *LaunchTemplateProvider) createLaunchTemplate(ctx context.Context, optio
return output.LaunchTemplate, nil
}

func (p *LaunchTemplateProvider) onCacheEvicted(key string, lt interface{}) {
p.Lock()
Copy link
Contributor

@ellistarn ellistarn Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to lock since the cache eviction is already threadsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache eviction is threadsafe, however onCacheEvicted is not threadsafe. For example, if an LT is evicted from the cache, ensureLaunchTemplate can be executed that will receive a cache miss, and then query LTs and find the LT that was evicted but has not been deleted yet. After it finds that LT, onCacheEvicted can run (if the lock is removed) and delete the LT before it is used which will propagate as a launch failure in Fleet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock isn't saving us in this case, since the LT creation has already happened. The only thing saving us from this race is the expiration timeout -- I guess this is fine (necessary?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock does save us in the scenario I mentioned. The LT creation I mentioned would occur in the ensureLaunchTemplate func which also takes a lock. Locking in both of these funcs ensures that the cache is always consistent with the state of EC2.

@bwagner5 bwagner5 requested a review from ellistarn February 7, 2022 19:15
}
launchTemplate := lt.(*ec2.LaunchTemplate)
if _, err := p.ec2api.DeleteLaunchTemplate(&ec2.DeleteLaunchTemplateInput{LaunchTemplateId: launchTemplate.LaunchTemplateId}); err != nil {
p.logger.Errorf("Unable to delete launch template, %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we'll want to retry this deletion if we fail? Is this called within a controller somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not retried (doesn't reconcile) since this is only executed on cache eviction, BUT it does rehydrate on startup, so if something did happen, a restart of Karpenter would clean them up.

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@bwagner5 bwagner5 merged commit 6b52c28 into aws:main Feb 9, 2022
@bwagner5 bwagner5 deleted the lt-reaper branch February 9, 2022 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants