-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494
Comments
My team builds EKS GPU images and I think we're facing this issue. We're using
Any advice? |
Today's release, v20240227 includes changes for Kubernetes 1.29 that address this issue. These changes will be backported to earlier Kubernetes versions in upcoming releases. There are a few things to note with this change:
Please reach out here or to AWS Support if these changes cause issues with your workload. This is a significant change and we expect some wrinkles will need ironing out. 😄 |
I'm running the latest version of AMI Here's the error message that I have in my app container:
While both dcgm-exporter and feature-discovery pods failed to startwith the message:
|
I had to add |
We've resolved this issue in all currently-supported Kubernetes versions. |
I am not sure if that's the case. When the gpu operator was installing the driver everything worked, but when I switched to the preinstalled driver ami from amazon everything went bonkers. Good news is that the cuda validation works and perhaps the issue is coming from somewhere else. Let me know if you need additional logs or information. I would appreciate anyone's input at this point. cuda-validation - privileged container - terminated, ready - Completed (exit code: 0) - This is from the nvidia operator, which seems to work as expected. dcgm-exporter and feature discovery also start correctly. My app container (not privileged) though yields the same error message as described above:
imageID: ami-02f8e26b00d356736 Kubelet version Kernel version Container runtime |
Did you find a solution? |
The gpu-operator isn't useful on the EKS accelerated AMI variants and usually results in various package conflict errors. The NVIDIA device plugin is usually all you need for the majority of use cases, and there is a stand-alone helm chart to deploy it |
With the latest Amazon Linux 2 kernels, customers running EC2 P4d, P4de and P5 instances may be unable to use the GPUDirect RDMA feature, which allows for faster communication between the NVIDIA driver and the EC2 Elastic Fabric Adapter (EFA).
This issue is caused by a change accepted by the Linux kernel community which introduced an incompatibility between the NVIDIA driver and the EFA driver. This change prevents the proprietary NVIDIA driver from dynamically linking to open source ones, such as EFA. We are currently working towards a solution to allow the use of the GPUDirect RDMA feature with the affected kernels.
Linux kernels with versions equal or above to the follow are affected:
The EKS-Optimized Accelerated AMI does not contain the affected kernel versions. By default, these AMIs have locked the kernel version and are not affected, unless the kernel version lock is manually removed. We recommend customers using custom AMIs to lock their kernel to a version lower than those listed above to prevent any impact on their workloads, until we have determined a solution. The kernel version can be locked with the following command:
The text was updated successfully, but these errors were encountered: