Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

Closed
cartermckinnon opened this issue Oct 26, 2023 · 8 comments

Comments

@cartermckinnon
Copy link
Member

cartermckinnon commented Oct 26, 2023

With the latest Amazon Linux 2 kernels, customers running EC2 P4d, P4de and P5 instances may be unable to use the GPUDirect RDMA feature, which allows for faster communication between the NVIDIA driver and the EC2 Elastic Fabric Adapter (EFA).

This issue is caused by a change accepted by the Linux kernel community which introduced an incompatibility between the NVIDIA driver and the EFA driver. This change prevents the proprietary NVIDIA driver from dynamically linking to open source ones, such as EFA. We are currently working towards a solution to allow the use of the GPUDirect RDMA feature with the affected kernels.

Linux kernels with versions equal or above to the follow are affected:

  • 4.14.326
  • 5.4.257
  • 5.10.195
  • 5.15.131
  • 6.1.52

The EKS-Optimized Accelerated AMI does not contain the affected kernel versions. By default, these AMIs have locked the kernel version and are not affected, unless the kernel version lock is manually removed. We recommend customers using custom AMIs to lock their kernel to a version lower than those listed above to prevent any impact on their workloads, until we have determined a solution. The kernel version can be locked with the following command:

sudo yum versionlock kernel*
@cartermckinnon cartermckinnon changed the title [GPU] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels [GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels Oct 26, 2023
@pfuntner
Copy link

My team builds EKS GPU images and I think we're facing this issue. We're using amazon-eks-gpu-node-1.28-v20231201 (ami-0a2b1b38a4684df6a in us-east-1 region) and I believe the kernel packages are already initially pinned.

yum upgrade output: eks gpu upgrade errors.txt

Any advice?

@cartermckinnon
Copy link
Member Author

Today's release, v20240227 includes changes for Kubernetes 1.29 that address this issue. These changes will be backported to earlier Kubernetes versions in upcoming releases.

There are a few things to note with this change:

  1. The open-source NVIDIA kernel module will be used on supported instance types. This is necessary for EFA to function.
  2. The proprietary NVIDIA kernel module will be used on instance types that are not supported by the open-source module.
  3. We've migrated from the legacy nvidia-docker2 package to the nvidia-container-toolkit.
  4. The latest version of the 535-series NVIDIA driver is used, 535.161.07.

Please reach out here or to AWS Support if these changes cause issues with your workload. This is a significant change and we expect some wrinkles will need ironing out. 😄

@farioas
Copy link

farioas commented Mar 3, 2024

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

@IshwarChandra
Copy link

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

I had to add privileged: true in securityContext of Daemonset's definition file. But I am not sure if this is recommended at all.

@cartermckinnon
Copy link
Member Author

We've resolved this issue in all currently-supported Kubernetes versions.

@the-veloper
Copy link

the-veloper commented Jul 10, 2024

@cartermckinnon

We've resolved this issue in all currently-supported Kubernetes versions.

I am not sure if that's the case. When the gpu operator was installing the driver everything worked, but when I switched to the preinstalled driver ami from amazon everything went bonkers. Good news is that the cuda validation works and perhaps the issue is coming from somewhere else. Let me know if you need additional logs or information. I would appreciate anyone's input at this point.

cuda-validation - privileged container - terminated, ready - Completed (exit code: 0) - This is from the nvidia operator, which seems to work as expected.

dcgm-exporter and feature discovery also start correctly.

My app container (not privileged) though yields the same error message as described above:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown

node.kubernetes.io/instance-type=g3s.xlarge

imageID: ami-02f8e26b00d356736

Kubelet version
v1.28.8-eks-ae9a62a ( upgraded from 1.27, although both are sill supported as of july 2024 ) (both didn't work)

Kernel version
5.10.219-208.866.amzn2.x86_64

Container runtime
containerd://1.7.11

@Rayshard
Copy link

@the-veloper

Did you find a solution?

@bryantbiggs
Copy link
Contributor

The gpu-operator isn't useful on the EKS accelerated AMI variants and usually results in various package conflict errors. The NVIDIA device plugin is usually all you need for the majority of use cases, and there is a stand-alone helm chart to deploy it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants