-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Allow managed nodegroup creation with Ubuntu AMIs and GPU instances #6452
Comments
We need to investigate how to best support this request. |
it's good to add this feature |
Re-opened this issue as a bug since it works with CPU instances but not GPU instances. |
Thank you @Himangini! |
It will be good to add this feature as one of our use case requires using Ubuntu AMIs. Thank you. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Please remove the stale label and keep this request open. Thank you! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I would like to keep this issue open please. Commenting so GitHub Actions will remove the stale label. Thank you. |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Hi, I wanted to point out that this already works today for CPU instances but is hard-coded not to allow creation for GPU instances. I think the relevant bit of code is here: https://github.com/eksctl-io/eksctl/blob/main/pkg/ami/auto_resolver.go#L81 |
Is this feature request done? |
This would be very nice indeed! |
What feature/behavior/change do you want?
Allow creation of managed nodegroups with Ubuntu AMIs when selecting a GPU instance.
Example cluster.yaml
Currently there is the expected warning that Ubuntu2004 does not ship with NVIDIA GPU drivers installed, but following this warning there is an error and cluster creation is terminated.
Why do you want this feature?
A managed Ubuntu 20.04 nodegroup with no GPU drivers installed works well with the NVIDIA GPU Operator which is installed via Helm and includes the NVIDIA GPU device plugin as well as a GPU driver container. This will provide a quick and easy way to create a managed GPU nodegroup with updated GPU drivers.
The GPU drivers included in the default Amazon Linux 2 AMI are typically out of date, for example the GPU drivers in the current AMI release are version 470.161.03. Making it easier to use the GPU operator on EKS will provide an easy way to create EKS clusters with the latest recommended drivers which are version 525.85.12.
For example, this works currently with
eksctl
if you make the nodegroup unmanaged, provide the overrideBootstrapCommand section, and provide the correct Ubuntu EKS AMI id from here: https://cloud-images.ubuntu.com/docs/aws/eks/.Install the NVIDIA GPU Operator via Helm chart.
Wait until all pods are deployed (~6-7 minutes or so). This will add GPU drivers and the GPU device plugin.
watch -n 5 kubectl get pods -n gpu-operator #Completed after about 7 minutes in testing Every 5.0s: kubectl get pods -n gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-sk42n 1/1 Running 0 6m34s gpu-operator-1679256184-node-feature-discovery-master-5cfdc2bx9 1/1 Running 0 7m2s gpu-operator-1679256184-node-feature-discovery-worker-n8k9v 1/1 Running 0 7m2s gpu-operator-79f94979f9-trnlp 1/1 Running 0 7m2s nvidia-container-toolkit-daemonset-zp8wb 1/1 Running 0 6m34s nvidia-device-plugin-daemonset-djjqf 1/1 Running 0 6m34s nvidia-driver-daemonset-nw7h7 1/1 Running 0 6m43
Filename:
nvidia-smi.yaml
It would be nice to streamline this, enable managed nodegroups, and avoid users having to look up and hard-code their AMI id.
The text was updated successfully, but these errors were encountered: