[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

JamesMaki · 2023-03-20T21:11:51Z

What feature/behavior/change do you want?

Allow creation of managed nodegroups with Ubuntu AMIs when selecting a GPU instance.

Example cluster.yaml

# cluster.yaml
# A cluster with a managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

managedNodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    minSize: 1
    desiredCapacity: 1
    maxSize: 1

Currently there is the expected warning that Ubuntu2004 does not ship with NVIDIA GPU drivers installed, but following this warning there is an error and cluster creation is terminated.

 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
2023-03-20 20:50:00 [!]  Ubuntu2004 does not ship with NVIDIA GPU drivers installed, hence won't support running GPU-accelerated workloads out of the box
2023-03-20 20:50:00 [ℹ]  eksctl version 0.134.0
2023-03-20 20:50:00 [ℹ]  using region us-west-2
2023-03-20 20:50:00 [ℹ]  skipping us-west-2d from selection because it doesn't support the following instance type(s): g4dn.xlarge
2023-03-20 20:50:00 [ℹ]  setting availability zones to [us-west-2a us-west-2c us-west-2b]
2023-03-20 20:50:00 [✖]  image family Ubuntu2004 doesn't support GPU image class

Why do you want this feature?

A managed Ubuntu 20.04 nodegroup with no GPU drivers installed works well with the NVIDIA GPU Operator which is installed via Helm and includes the NVIDIA GPU device plugin as well as a GPU driver container. This will provide a quick and easy way to create a managed GPU nodegroup with updated GPU drivers.

The GPU drivers included in the default Amazon Linux 2 AMI are typically out of date, for example the GPU drivers in the current AMI release are version 470.161.03. Making it easier to use the GPU operator on EKS will provide an easy way to create EKS clusters with the latest recommended drivers which are version 525.85.12.

For example, this works currently with eksctl if you make the nodegroup unmanaged, provide the overrideBootstrapCommand section, and provide the correct Ubuntu EKS AMI id from here: https://cloud-images.ubuntu.com/docs/aws/eks/.

# cluster.yaml
# A cluster with a self-managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

nodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    # grab AMI ID for Ubuntu EKS AMI here: https://cloud-images.ubuntu.com/aws-eks/
    # using AMI ID for us-west-2 region: ami-06cd6fdaf5a24b728
    ami: ami-06cd6fdaf5a24b728
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    overrideBootstrapCommand: |
      #!/bin/bash
      source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
      /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"

 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
 $ aws eks --region us-west-2 update-kubeconfig --name test-cluster

Install the NVIDIA GPU Operator via Helm chart.

$ helm install --repo https://helm.ngc.nvidia.com/nvidia --wait --generate-name -n gpu-operator \
      --create-namespace gpu-operator
NAME: gpu-operator-1670843572
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait until all pods are deployed (~6-7 minutes or so). This will add GPU drivers and the GPU device plugin.

watch -n 5 kubectl get pods -n gpu-operator

#Completed after about 7 minutes in testing
Every 5.0s: kubectl get pods -n gpu-operator                                                                                                                                                                                                                                                                                       

NAME                                                              READY   STATUS      RESTARTS   AGE                                                                                                                                                                                                                                                                  
gpu-feature-discovery-sk42n                                       1/1     Running     0          6m34s
gpu-operator-1679256184-node-feature-discovery-master-5cfdc2bx9   1/1     Running     0          7m2s
gpu-operator-1679256184-node-feature-discovery-worker-n8k9v       1/1     Running     0          7m2s
gpu-operator-79f94979f9-trnlp                                     1/1     Running     0          7m2s
nvidia-container-toolkit-daemonset-zp8wb                          1/1     Running     0          6m34s
nvidia-device-plugin-daemonset-djjqf                              1/1     Running     0          6m34s
nvidia-driver-daemonset-nw7h7                                     1/1     Running     0          6m43

Filename: nvidia-smi.yaml

# nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: ubuntu:22.04
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1

$ kubectl apply -f nvidia-smi.yaml
$ kubectl logs nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     8W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It would be nice to streamline this, enable managed nodegroups, and avoid users having to look up and hard-code their AMI id.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-03-20T21:12:32Z

Hello JamesMaki 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

Himangini · 2023-03-29T12:24:09Z

We need to investigate how to best support this request.
Spike : 1 day

angudadevops · 2023-04-05T23:07:04Z

it's good to add this feature

JamesMaki · 2023-04-06T20:45:40Z

Re-opened this issue as a bug since it works with CPU instances but not GPU instances.

Himangini · 2023-05-09T10:55:14Z

#6499 (comment)

JamesMaki · 2023-05-09T15:46:52Z

Thank you @Himangini!

tanmatth · 2023-05-12T17:49:19Z

It will be good to add this feature as one of our use case requires using Ubuntu AMIs. Thank you.

github-actions · 2023-06-26T02:08:12Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

JamesMaki · 2023-06-26T02:09:31Z

Please remove the stale label and keep this request open. Thank you!

github-actions · 2023-10-28T01:45:45Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

JamesMaki · 2023-10-28T01:50:39Z

I would like to keep this issue open please. Commenting so GitHub Actions will remove the stale label. Thank you.

github-actions · 2023-11-28T01:48:15Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

JamesMaki · 2023-12-12T21:13:37Z

Hi, I wanted to point out that this already works today for CPU instances but is hard-coded not to allow creation for GPU instances. I think the relevant bit of code is here: https://github.com/eksctl-io/eksctl/blob/main/pkg/ami/auto_resolver.go#L81

xyfleet · 2024-01-29T21:06:43Z

Is this feature request done?

montanaflynn · 2024-06-17T17:22:55Z

This would be very nice indeed!

JamesMaki added the kind/feature New feature or request label Mar 20, 2023

Himangini added the needs-investigation label Mar 29, 2023

JamesMaki closed this as completed Apr 6, 2023

Himangini mentioned this issue May 9, 2023

[Bug] Managed nodegroup creation fails with Ubuntu AMIs and GPU instances #6499

Closed

Himangini reopened this May 9, 2023

github-actions bot added the stale label Jun 26, 2023

github-actions bot removed the stale label Jun 27, 2023

github-actions bot added the stale label Oct 28, 2023

github-actions bot removed the stale label Oct 29, 2023

github-actions bot added the stale label Nov 28, 2023

cPu1 added priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed stale labels Nov 28, 2023

jameslamb mentioned this issue Aug 7, 2024

EKS example does not work by default rapidsai/deployment#409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

JamesMaki commented Mar 20, 2023 •

edited

Loading

github-actions bot commented Mar 20, 2023

Himangini commented Mar 29, 2023

angudadevops commented Apr 5, 2023

JamesMaki commented Apr 6, 2023

Himangini commented May 9, 2023

JamesMaki commented May 9, 2023

tanmatth commented May 12, 2023

github-actions bot commented Jun 26, 2023

JamesMaki commented Jun 26, 2023

github-actions bot commented Oct 28, 2023

JamesMaki commented Oct 28, 2023 •

edited

Loading

github-actions bot commented Nov 28, 2023

JamesMaki commented Dec 12, 2023

xyfleet commented Jan 29, 2024

montanaflynn commented Jun 17, 2024

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

Comments

JamesMaki commented Mar 20, 2023 • edited Loading

What feature/behavior/change do you want?

Why do you want this feature?

github-actions bot commented Mar 20, 2023

Himangini commented Mar 29, 2023

angudadevops commented Apr 5, 2023

JamesMaki commented Apr 6, 2023

Himangini commented May 9, 2023

JamesMaki commented May 9, 2023

tanmatth commented May 12, 2023

github-actions bot commented Jun 26, 2023

JamesMaki commented Jun 26, 2023

github-actions bot commented Oct 28, 2023

JamesMaki commented Oct 28, 2023 • edited Loading

github-actions bot commented Nov 28, 2023

JamesMaki commented Dec 12, 2023

xyfleet commented Jan 29, 2024

montanaflynn commented Jun 17, 2024

JamesMaki commented Mar 20, 2023 •

edited

Loading

JamesMaki commented Oct 28, 2023 •

edited

Loading