-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Select GPU variant of eks optimized ami for nvidia & neuron #684
Conversation
✔️ Deploy Preview for karpenter-docs-prod ready! 🔨 Explore the source changes: a34346b 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/614a1a25e6c82f000720ab61 😎 Browse the preview: https://deploy-preview-684--karpenter-docs-prod.netlify.app |
3068267
to
b3ef776
Compare
pkg/cloudprovider/aws/ami.go
Outdated
} | ||
amiNameSuffix = "-gpu" | ||
} | ||
name := fmt.Sprintf("/aws/service/eks/optimized-ami/%s/amazon-linux-2%s/recommended/image_id", version, amiNameSuffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move all of this ami name logic into a separate function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aye, aye, cap'n
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
// NeedsDocker returns true if the instance type is unable to use | ||
// conatinerd directly | ||
func NeedsDocker(is []cloudprovider.InstanceType) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO these functions are better encapsulated by launchtemplate.go. Until now, this file was just a pure implementation of the cloudprovider.InstanceType interface. If you move them, I'd make the methods private, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
func (p *LaunchTemplateProvider) getUserData(ctx context.Context, provisioner *v1alpha3.Provisioner, constraints *Constraints) (string, error) { | ||
func (p *LaunchTemplateProvider) getUserData(ctx context.Context, provisioner *v1alpha3.Provisioner, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) { | ||
var containerRuntimeArg string | ||
if !NeedsDocker(instanceTypes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard is it to remove the docker dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I do not understand the question, can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard is it to use containerd for GPUs? Happy if the answer is "out of scope for now"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is, it's not hard. The EKS optimized AMI does already support it for Nvidia GPUs, just not for Neurons, but that should be available soon, IIUC.
Does Karpenter play well w/ the nvidia driver installer? How long does it take to bring up a new gpu node? |
It's a good question - I'm not sure yet. |
pkg/cloudprovider/aws/ami.go
Outdated
@@ -45,7 +45,7 @@ func NewAMIProvider(ssm ssmiface.SSMAPI, clientSet *kubernetes.Clientset) *AMIPr | |||
} | |||
} | |||
|
|||
func (p *AMIProvider) Get(ctx context.Context, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) { | |||
func (p *AMIProvider) getSSMParameter(ctx context.Context, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
completely optional: I might move this function below Get()
. I tend to follow a mixed pattern of DFS and "bigger concepts to the top"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Is this working or still WIP?
4ec6515
to
53ea2e0
Compare
53ea2e0
to
a34346b
Compare
I'd like to check it in. I'm still testing, and it seems likely there is some kind of race going on with the nvidia daemonset, but I think this code is arguably a step in the right direction (gives things a chance of working), and I'm getting tired of rebasing all the time :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
@@ -149,6 +145,28 @@ func (p *LaunchTemplateProvider) ensureLaunchTemplate(ctx context.Context, optio | |||
return launchTemplate, nil | |||
} | |||
|
|||
func needsGPUAmi(is []cloudprovider.InstanceType) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might've misled you earlier. Should these live in ami.go (where they're used)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor, optional
1. Issue, if available:
#683
2. Description of changes:
Uses the "-gpu" variant of eks optimized AMI if there are any nvidia instances in the list of instance types
Prelim test results:
Applied to cluster w/ no nvidia GPU instances:
karpenter-controller logs:
nvidia-smi
logs:3. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.