Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452

Open
JamesMaki opened this issue Mar 20, 2023 · 15 comments
Labels
kind/feature New feature or request needs-investigation priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@JamesMaki
Copy link

JamesMaki commented Mar 20, 2023

What feature/behavior/change do you want?

Allow creation of managed nodegroups with Ubuntu AMIs when selecting a GPU instance.

Example cluster.yaml

# cluster.yaml
# A cluster with a managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

managedNodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    minSize: 1
    desiredCapacity: 1
    maxSize: 1

Currently there is the expected warning that Ubuntu2004 does not ship with NVIDIA GPU drivers installed, but following this warning there is an error and cluster creation is terminated.

 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
2023-03-20 20:50:00 [!]  Ubuntu2004 does not ship with NVIDIA GPU drivers installed, hence won't support running GPU-accelerated workloads out of the box
2023-03-20 20:50:00 [ℹ]  eksctl version 0.134.0
2023-03-20 20:50:00 [ℹ]  using region us-west-2
2023-03-20 20:50:00 [ℹ]  skipping us-west-2d from selection because it doesn't support the following instance type(s): g4dn.xlarge
2023-03-20 20:50:00 [ℹ]  setting availability zones to [us-west-2a us-west-2c us-west-2b]
2023-03-20 20:50:00 [✖]  image family Ubuntu2004 doesn't support GPU image class

Why do you want this feature?

A managed Ubuntu 20.04 nodegroup with no GPU drivers installed works well with the NVIDIA GPU Operator which is installed via Helm and includes the NVIDIA GPU device plugin as well as a GPU driver container. This will provide a quick and easy way to create a managed GPU nodegroup with updated GPU drivers.

The GPU drivers included in the default Amazon Linux 2 AMI are typically out of date, for example the GPU drivers in the current AMI release are version 470.161.03. Making it easier to use the GPU operator on EKS will provide an easy way to create EKS clusters with the latest recommended drivers which are version 525.85.12.

For example, this works currently with eksctl if you make the nodegroup unmanaged, provide the overrideBootstrapCommand section, and provide the correct Ubuntu EKS AMI id from here: https://cloud-images.ubuntu.com/docs/aws/eks/.

# cluster.yaml
# A cluster with a self-managed Ubuntu nodegroup.
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: test-cluster
  region: us-west-2

nodeGroups:
  - name: gpu-nodegroup
    instanceType: g4dn.xlarge
    amiFamily: Ubuntu2004
    # grab AMI ID for Ubuntu EKS AMI here: https://cloud-images.ubuntu.com/aws-eks/
    # using AMI ID for us-west-2 region: ami-06cd6fdaf5a24b728
    ami: ami-06cd6fdaf5a24b728
    minSize: 1
    desiredCapacity: 1
    maxSize: 1
    overrideBootstrapCommand: |
      #!/bin/bash
      source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
      /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
 $ eksctl create cluster --install-nvidia-plugin=false --config-file cluster.yaml 
 $ aws eks --region us-west-2 update-kubeconfig --name test-cluster

Install the NVIDIA GPU Operator via Helm chart.

$ helm install --repo https://helm.ngc.nvidia.com/nvidia --wait --generate-name -n gpu-operator \
      --create-namespace gpu-operator
NAME: gpu-operator-1670843572
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

Wait until all pods are deployed (~6-7 minutes or so). This will add GPU drivers and the GPU device plugin.

watch -n 5 kubectl get pods -n gpu-operator

#Completed after about 7 minutes in testing
Every 5.0s: kubectl get pods -n gpu-operator                                                                                                                                                                                                                                                                                       

NAME                                                              READY   STATUS      RESTARTS   AGE                                                                                                                                                                                                                                                                  
gpu-feature-discovery-sk42n                                       1/1     Running     0          6m34s
gpu-operator-1679256184-node-feature-discovery-master-5cfdc2bx9   1/1     Running     0          7m2s
gpu-operator-1679256184-node-feature-discovery-worker-n8k9v       1/1     Running     0          7m2s
gpu-operator-79f94979f9-trnlp                                     1/1     Running     0          7m2s
nvidia-container-toolkit-daemonset-zp8wb                          1/1     Running     0          6m34s
nvidia-device-plugin-daemonset-djjqf                              1/1     Running     0          6m34s
nvidia-driver-daemonset-nw7h7                                     1/1     Running     0          6m43

Filename: nvidia-smi.yaml

# nvidia-smi.yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: ubuntu:22.04
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1
$ kubectl apply -f nvidia-smi.yaml
$ kubectl logs nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   33C    P8     8W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

It would be nice to streamline this, enable managed nodegroups, and avoid users having to look up and hard-code their AMI id.

@JamesMaki JamesMaki added the kind/feature New feature or request label Mar 20, 2023
@github-actions
Copy link
Contributor

Hello JamesMaki 👋 Thank you for opening an issue in eksctl project. The team will review the issue and aim to respond within 1-5 business days. Meanwhile, please read about the Contribution and Code of Conduct guidelines here. You can find out more information about eksctl on our website

@Himangini
Copy link
Collaborator

We need to investigate how to best support this request.
Spike : 1 day

@angudadevops
Copy link

it's good to add this feature

@JamesMaki
Copy link
Author

Re-opened this issue as a bug since it works with CPU instances but not GPU instances.

@Himangini
Copy link
Collaborator

#6499 (comment)

@Himangini Himangini reopened this May 9, 2023
@JamesMaki
Copy link
Author

Thank you @Himangini!

@tanmatth
Copy link

It will be good to add this feature as one of our use case requires using Ubuntu AMIs. Thank you.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jun 26, 2023
@JamesMaki
Copy link
Author

Please remove the stale label and keep this request open. Thank you!

@github-actions github-actions bot removed the stale label Jun 27, 2023
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Oct 28, 2023
@JamesMaki
Copy link
Author

JamesMaki commented Oct 28, 2023

I would like to keep this issue open please. Commenting so GitHub Actions will remove the stale label. Thank you.

@github-actions github-actions bot removed the stale label Oct 29, 2023
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Nov 28, 2023
@cPu1 cPu1 added priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed stale labels Nov 28, 2023
@JamesMaki
Copy link
Author

Hi, I wanted to point out that this already works today for CPU instances but is hard-coded not to allow creation for GPU instances. I think the relevant bit of code is here: https://github.com/eksctl-io/eksctl/blob/main/pkg/ami/auto_resolver.go#L81

@xyfleet
Copy link

xyfleet commented Jan 29, 2024

Is this feature request done?

@montanaflynn
Copy link

This would be very nice indeed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request needs-investigation priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
Development

No branches or pull requests

7 participants