Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add nvidia driver version 520 for GPU enabled EKS AMI image #1060

Closed
jsuto opened this issue Oct 26, 2022 · 12 comments · Fixed by #1439
Closed

Please add nvidia driver version 520 for GPU enabled EKS AMI image #1060

jsuto opened this issue Oct 26, 2022 · 12 comments · Fixed by #1439
Labels
enhancement New feature or request Work in Progress

Comments

@jsuto
Copy link

jsuto commented Oct 26, 2022

What would you like to be added:

nvidia driver version 520 and related packages need on a GPU enabled EKS host.

Why is this needed:

The current EKS AMI features nvidia driver version 470. However, we have a software that requires a newer version. nvidia driver 510 seems to work for us, though it might be better to ship the latest version 520.

@aschleck
Copy link

Has there been any movement on this? We're using jax which is very particular about matching cuda and nvidia driver releases (so 470 means the highest cuda we can use is 11.4.) Now that cuda 12 is out, any chance we can get the driver version bumped?

@al1y
Copy link

al1y commented May 30, 2023

Any progress here? Tried building my own custom image but had 0 luck

@aschleck
Copy link

I haven't tried it yet but I believe the best solution here is:

  1. Move off the Amazon AMIs completely to https://cloud-images.ubuntu.com/docs/aws/eks/
  2. Install this operator: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/amazon-eks.html which let's you use whatever version of cuda you want. Then you can dump the device plugin too.

@al1y
Copy link

al1y commented May 30, 2023

👍 will give that a go - ty for response!

@sjkoelle
Copy link

howd it go? we'd very much like to use torch 2 and need 510 drivers, which unfortunately still seem like they are not supported by default.

@aschleck
Copy link

We disabled GPU support on our EKS cluster and moved everything to GKE, where the default image is currently on driver version 525. Plus A100s come in more shapes and are cheaper there.

@sjkoelle
Copy link

I tried basically following only step 2 of your suggestion and was unsuccessful (NVIDIA/gpu-operator#542). I think perhaps it is impossible to update prebuilt drivers (NVIDIA/gpu-operator#525). We are pretty ensconced in AWS, wondering if there even is a solution there... maybe running https://aws.amazon.com/marketplace/pp/prodview-h3v6xvwe36v74 as you suggest?

fwiw other AWS AMIs do run 510 driver versions, but my understanding is that these don't come with EKS support.

@sjkoelle
Copy link

sjkoelle commented Jun 27, 2023

Okay, step 1 was was clearly critical - working now. Thanks for the useful thread!

@ptailor1193 ptailor1193 added enhancement New feature or request Work in Progress labels Jun 28, 2023
@ptailor1193
Copy link

We plan to upgrade the NVIDIA drivers in our EKS Optimized Accelerated AMI to the newer 525 series with a future Kubernetes version. For customers who want to stay on older Kubernetes versions, we will also provide a way of upgrading the NVIDIA drivers with the existing Accelerated AMI via documentation.

@mlschindler
Copy link

We plan to upgrade the NVIDIA drivers in our EKS Optimized Accelerated AMI to the newer 525 series with a future Kubernetes version. For customers who want to stay on older Kubernetes versions, we will also provide a way of upgrading the NVIDIA drivers with the existing Accelerated AMI via documentation.

Can you share a link to this documentation? What we're seeing on EKS 1.25 is that depending on the node being used, the version of NVIDIA drivers are different between them. So I am not convinced it's related to just EKS AMI, unless I am not understanding something.

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Aug 4, 2023

If needed, you can run the following on the EKS GPU AMI to install a newer driver, just provide the driver intended driver version:

# Versions
# Driver 525.125.06 / CUDA 12.0
# Driver 535.54.03 / CUDA 12.2

# DRIVER=525.125.06
DRIVER=535.54.03

sudo yum install gcc10 -y
sudo wget -O /tmp/NVIDIA-Linux-driver.run "https://us.download.nvidia.com/tesla/${DRIVER}/NVIDIA-Linux-x86_64-${DRIVER}.run"
sudo CC=gcc10-cc sh /tmp/NVIDIA-Linux-driver.run -q -a --ui=none

You could do this in the user data and install it during instance startup. However, this adds a bit of time to instance startup. Instead, launch a standalone EC2 using the EKS GPU AMI (you don't need to supply the cluster bootstrap script, its not meant to connect to a cluster at this time), run the commands above, and then create a snapshot from the instance to create an AMI for use in your nodegroups

⚠️ This information is provided to help folks install their own drivers and devices. You should thoroughly test and validate before deploying your workload. The configuration/guidance provided is not part of an AWS service and support is provided as best-effort by the maintainers. As stated here, official EKS support for newer drivers and devices will come on a future Kubernetes version of EKS

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Aug 14, 2023

here is an initial Packer configuration to build an EKS AMI for use with NVIDIA GPUs - this is suitable for P5 instances as well https://github.com/clowdhaus/amazon-eks-gpu-ami

This will be moving over to https://github.com/aws-samples/amazon-eks-custom-amis this week

⚠️ This information is provided to help folks install their own drivers and devices. You should thoroughly test and validate before deploying your workload. The configuration/guidance provided is not part of an AWS service and support is provided as best-effort by the maintainers. As stated here, official EKS support for newer drivers and devices will come on a future Kubernetes version of EKS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Work in Progress
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants