Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select GPU variant of eks optimized ami for nvidia & neuron #684

Merged
merged 5 commits into from
Sep 21, 2021

Conversation

JacobGabrielson
Copy link
Contributor

@JacobGabrielson JacobGabrielson commented Sep 16, 2021

1. Issue, if available:

#683

2. Description of changes:

Uses the "-gpu" variant of eks optimized AMI if there are any nvidia instances in the list of instance types

Prelim test results:

Applied to cluster w/ no nvidia GPU instances:

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:latest
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 4

karpenter-controller logs:

2021-09-16T21:57:54.552884348Z 2021-09-16T21:57:54.552Z	INFO	controller.allocation.provisioner/default	Starting provisioning loop	{"commit": "569c3bc"}
2021-09-16T21:57:54.552932259Z 2021-09-16T21:57:54.552Z	INFO	controller.allocation.provisioner/default	Waiting to batch additional pods	{"commit": "569c3bc"}
2021-09-16T21:57:55.689190601Z 2021-09-16T21:57:55.689Z	INFO	controller.allocation.provisioner/default	Found 1 provisionable pods	{"commit": "569c3bc"}
2021-09-16T21:57:56.602742898Z 2021-09-16T21:57:56.602Z	DEBUG	controller.allocation.provisioner/default	Discovered 309 EC2 instance types	{"commit": "569c3bc"}
2021-09-16T21:57:56.604835556Z 2021-09-16T21:57:56.604Z	DEBUG	controller.allocation.provisioner/default	Excluding instance type t3.nano because there are not enough resources for kubelet and system overhead	{"commit": "569c3bc"}
2021-09-16T21:57:56.605042600Z 2021-09-16T21:57:56.604Z	DEBUG	controller.allocation.provisioner/default	Excluding instance type t3a.nano because there are not enough resources for kubelet and system overhead	{"commit": "569c3bc"}
2021-09-16T21:57:56.606514867Z 2021-09-16T21:57:56.606Z	INFO	controller.allocation.provisioner/default	Computed packing for 1 pod(s) with instance type option(s) [g2.8xlarge g4dn.12xlarge p3.8xlarge g3.16xlarge p2.8xlarge p3.16xlarge p3dn.24xlarge p4d.24xlarge p2.16xlarge]	{"commit": "569c3bc"}
2021-09-16T21:57:56.850586631Z 2021-09-16T21:57:56.850Z	DEBUG	controller.allocation.provisioner/default	Discovered 3 subnets for cluster jacob-karpenter-demo	{"commit": "569c3bc"}
2021-09-16T21:57:56.857076091Z 2021-09-16T21:57:56.856Z	DEBUG	controller.allocation.provisioner/default	Discovered kubernetes version 1.20	{"commit": "569c3bc"}
2021-09-16T21:57:56.924673505Z 2021-09-16T21:57:56.924Z	DEBUG	controller.allocation.provisioner/default	Discovered ami ami-0b1f0fc1fb3651d9f for query /aws/service/eks/optimized-ami/1.20/amazon-linux-2-gpu/recommended/image_id	{"commit": "569c3bc"}
2021-09-16T21:57:57.131200476Z 2021-09-16T21:57:57.130Z	DEBUG	controller.allocation.provisioner/default	Discovered 1 security groups for cluster jacob-karpenter-demo	{"commit": "569c3bc"}
2021-09-16T21:57:57.131240937Z 2021-09-16T21:57:57.131Z	DEBUG	controller.allocation.provisioner/default	Discovered caBundle, length 1066	{"commit": "569c3bc"}
2021-09-16T21:57:57.251842641Z 2021-09-16T21:57:57.251Z	DEBUG	controller.allocation.provisioner/default	Created launch template, Karpenter-jacob-karpenter-demo-15956394333737086452	{"commit": "569c3bc"}
2021-09-16T21:57:58.939682393Z 2021-09-16T21:57:58.939Z	INFO	controller.allocation.provisioner/default	Launched instance: i-0f9d2818ce653bdb0, type: g2.8xlarge, zone: us-west-2b, hostname: ip-192-168-132-145.us-west-2.compute.internal	{"commit": "569c3bc"}
2021-09-16T21:57:58.984805548Z 2021-09-16T21:57:58.984Z	INFO	controller.allocation.provisioner/default	Bound 1 pod(s) to node ip-192-168-132-145.us-west-2.compute.internal	{"commit": "569c3bc"}

nvidia-smi logs:

kubectl logs nvidia-smi
Thu Sep 16 22:43:05 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 00000000:00:03.0 N/A |                  N/A |
| N/A   32C    P8    N/A /  N/A |      0MiB /  4037MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GRID K520           On   | 00000000:00:04.0 N/A |                  N/A |
| N/A   34C    P8    N/A /  N/A |      0MiB /  4037MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GRID K520           On   | 00000000:00:05.0 N/A |                  N/A |
| N/A   34C    P8    N/A /  N/A |      0MiB /  4037MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GRID K520           On   | 00000000:00:06.0 N/A |                  N/A |
| N/A   37C    P8    N/A /  N/A |      0MiB /  4037MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Sep 16, 2021

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: a34346b

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/614a1a25e6c82f000720ab61

😎 Browse the preview: https://deploy-preview-684--karpenter-docs-prod.netlify.app

@JacobGabrielson JacobGabrielson linked an issue Sep 16, 2021 that may be closed by this pull request
@JacobGabrielson JacobGabrielson changed the title [WIP] first cut at selecting -gpu ami for nvidia elect GPU variant of eks optimized ami for nvidia Sep 16, 2021
@JacobGabrielson JacobGabrielson changed the title elect GPU variant of eks optimized ami for nvidia Select GPU variant of eks optimized ami for nvidia Sep 16, 2021
@JacobGabrielson JacobGabrielson marked this pull request as ready for review September 16, 2021 22:47
@JacobGabrielson JacobGabrielson changed the title Select GPU variant of eks optimized ami for nvidia Select GPU variant of eks optimized ami for nvidia & neuron Sep 17, 2021
}
amiNameSuffix = "-gpu"
}
name := fmt.Sprintf("/aws/service/eks/optimized-ami/%s/amazon-linux-2%s/recommended/image_id", version, amiNameSuffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move all of this ami name logic into a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, aye, cap'n

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// NeedsDocker returns true if the instance type is unable to use
// conatinerd directly
func NeedsDocker(is []cloudprovider.InstanceType) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO these functions are better encapsulated by launchtemplate.go. Until now, this file was just a pure implementation of the cloudprovider.InstanceType interface. If you move them, I'd make the methods private, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func (p *LaunchTemplateProvider) getUserData(ctx context.Context, provisioner *v1alpha3.Provisioner, constraints *Constraints) (string, error) {
func (p *LaunchTemplateProvider) getUserData(ctx context.Context, provisioner *v1alpha3.Provisioner, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) {
var containerRuntimeArg string
if !NeedsDocker(instanceTypes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to remove the docker dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I do not understand the question, can you clarify?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How hard is it to use containerd for GPUs? Happy if the answer is "out of scope for now"

Copy link
Contributor Author

@JacobGabrielson JacobGabrielson Sep 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is, it's not hard. The EKS optimized AMI does already support it for Nvidia GPUs, just not for Neurons, but that should be available soon, IIUC.

@ellistarn
Copy link
Contributor

Does Karpenter play well w/ the nvidia driver installer? How long does it take to bring up a new gpu node?

@JacobGabrielson
Copy link
Contributor Author

Does Karpenter play well w/ the nvidia driver installer? How long does it take to bring up a new gpu node?

It's a good question - I'm not sure yet.

@JacobGabrielson JacobGabrielson changed the title Select GPU variant of eks optimized ami for nvidia & neuron [WIP] Select GPU variant of eks optimized ami for nvidia & neuron Sep 18, 2021
@@ -45,7 +45,7 @@ func NewAMIProvider(ssm ssmiface.SSMAPI, clientSet *kubernetes.Clientset) *AMIPr
}
}

func (p *AMIProvider) Get(ctx context.Context, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) {
func (p *AMIProvider) getSSMParameter(ctx context.Context, constraints *Constraints, instanceTypes []cloudprovider.InstanceType) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely optional: I might move this function below Get(). I tend to follow a mixed pattern of DFS and "bigger concepts to the top"

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is this working or still WIP?

@JacobGabrielson JacobGabrielson force-pushed the jeepy-ewe branch 2 times, most recently from 4ec6515 to 53ea2e0 Compare September 19, 2021 05:07
@JacobGabrielson JacobGabrielson changed the title [WIP] Select GPU variant of eks optimized ami for nvidia & neuron Select GPU variant of eks optimized ami for nvidia & neuron Sep 21, 2021
@JacobGabrielson
Copy link
Contributor Author

LGTM. Is this working or still WIP?

I'd like to check it in. I'm still testing, and it seems likely there is some kind of race going on with the nvidia daemonset, but I think this code is arguably a step in the right direction (gives things a chance of working), and I'm getting tired of rebasing all the time :-)

Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@@ -149,6 +145,28 @@ func (p *LaunchTemplateProvider) ensureLaunchTemplate(ctx context.Context, optio
return launchTemplate, nil
}

func needsGPUAmi(is []cloudprovider.InstanceType) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might've misled you earlier. Should these live in ami.go (where they're used)?

Copy link
Contributor

@ellistarn ellistarn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor, optional

@JacobGabrielson JacobGabrielson merged commit 903fffc into aws:main Sep 21, 2021
@JacobGabrielson JacobGabrielson deleted the jeepy-ewe branch September 21, 2021 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

karpenter should use gpu optimized ami when needed
4 participants