[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

cartermckinnon · 2023-10-26T19:12:12Z

With the latest Amazon Linux 2 kernels, customers running EC2 P4d, P4de and P5 instances may be unable to use the GPUDirect RDMA feature, which allows for faster communication between the NVIDIA driver and the EC2 Elastic Fabric Adapter (EFA).

This issue is caused by a change accepted by the Linux kernel community which introduced an incompatibility between the NVIDIA driver and the EFA driver. This change prevents the proprietary NVIDIA driver from dynamically linking to open source ones, such as EFA. We are currently working towards a solution to allow the use of the GPUDirect RDMA feature with the affected kernels.

Linux kernels with versions equal or above to the follow are affected:

4.14.326
5.4.257
5.10.195
5.15.131
6.1.52

The EKS-Optimized Accelerated AMI does not contain the affected kernel versions. By default, these AMIs have locked the kernel version and are not affected, unless the kernel version lock is manually removed. We recommend customers using custom AMIs to lock their kernel to a version lower than those listed above to prevent any impact on their workloads, until we have determined a solution. The kernel version can be locked with the following command:

sudo yum versionlock kernel*

The text was updated successfully, but these errors were encountered:

pfuntner · 2023-12-18T17:42:59Z

My team builds EKS GPU images and I think we're facing this issue. We're using amazon-eks-gpu-node-1.28-v20231201 (ami-0a2b1b38a4684df6a in us-east-1 region) and I believe the kernel packages are already initially pinned.

yum upgrade output: eks gpu upgrade errors.txt

Any advice?

cartermckinnon · 2024-02-28T19:45:02Z

Today's release, v20240227 includes changes for Kubernetes 1.29 that address this issue. These changes will be backported to earlier Kubernetes versions in upcoming releases.

There are a few things to note with this change:

The open-source NVIDIA kernel module will be used on supported instance types. This is necessary for EFA to function.
The proprietary NVIDIA kernel module will be used on instance types that are not supported by the open-source module.
We've migrated from the legacy nvidia-docker2 package to the nvidia-container-toolkit.
The latest version of the 535-series NVIDIA driver is used, 535.161.07.

Please reach out here or to AWS Support if these changes cause issues with your workload. This is a significant change and we expect some wrinkles will need ironing out. 😄

farioas · 2024-03-03T23:24:19Z

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

IshwarChandra · 2024-03-04T11:19:53Z

I'm running the latest version of AMI v1.29.0-eks-5e0fdde and nvidia-gpu-operator v23.9.1 on g5.48xlarge

Here's the error message that I have in my app container:

    Last State:     Terminated
      Reason:       StartError
      Message:      failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown
      Exit Code:    128

While both dcgm-exporter and feature-discovery pods failed to startwith the message:

    Last State:    Terminated
      Reason:      StartError
      Message:     failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: detection error: nvml error: unknown error: unknown

I had to add privileged: true in securityContext of Daemonset's definition file. But I am not sure if this is recommended at all.

cartermckinnon · 2024-07-02T19:39:00Z

We've resolved this issue in all currently-supported Kubernetes versions.

the-veloper · 2024-07-10T09:53:00Z

@cartermckinnon

We've resolved this issue in all currently-supported Kubernetes versions.

I am not sure if that's the case. When the gpu operator was installing the driver everything worked, but when I switched to the preinstalled driver ami from amazon everything went bonkers. Good news is that the cuda validation works and perhaps the issue is coming from somewhere else. Let me know if you need additional logs or information. I would appreciate anyone's input at this point.

cuda-validation - privileged container - terminated, ready - Completed (exit code: 0) - This is from the nvidia operator, which seems to work as expected.

dcgm-exporter and feature discovery also start correctly.

My app container (not privileged) though yields the same error message as described above:

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: /usr/local/nvidia/toolkit/nvidia-container-cli.real: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /usr/local/nvidia/toolkit/libnvidia-container.so.1): unknown

node.kubernetes.io/instance-type=g3s.xlarge

imageID: ami-02f8e26b00d356736

Kubelet version
v1.28.8-eks-ae9a62a ( upgraded from 1.27, although both are sill supported as of july 2024 ) (both didn't work)

Kernel version
5.10.219-208.866.amzn2.x86_64

Container runtime
containerd://1.7.11

Rayshard · 2024-11-27T13:34:14Z

@the-veloper

Did you find a solution?

bryantbiggs · 2024-11-27T13:54:09Z

The gpu-operator isn't useful on the EKS accelerated AMI variants and usually results in various package conflict errors. The NVIDIA device plugin is usually all you need for the majority of use cases, and there is a stand-alone helm chart to deploy it

cartermckinnon changed the title ~~[GPU] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels~~ [GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels Oct 26, 2023

cartermckinnon mentioned this issue Jan 8, 2024

Amazon-eks-gpu AMI nvidia-container-toolkit dependency questions #1560

Closed

This was referenced Jan 16, 2024

Problem with NVIDIA GSP and g4dn, g5, and g5g instances #1523

Closed

containerd certificate config using incorrect header section for the EKS 1.24 GPU AMI #1168

Closed

farioas mentioned this issue Mar 3, 2024

Unable to run pod on G5 48xlarge instance, other g5 instance works well NVIDIA/gpu-operator#634

Open

6 tasks

cartermckinnon closed this as completed Jul 2, 2024

aedenj mentioned this issue Dec 28, 2024

Resolve Whether to Install NVidia GPU operator or Nvidia Device Plugin archegos-labs/platform#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

cartermckinnon commented Oct 26, 2023 •

edited

Loading

pfuntner commented Dec 18, 2023

cartermckinnon commented Feb 28, 2024

farioas commented Mar 3, 2024 •

edited

Loading

IshwarChandra commented Mar 4, 2024

cartermckinnon commented Jul 2, 2024

the-veloper commented Jul 10, 2024 •

edited

Loading

Rayshard commented Nov 27, 2024

bryantbiggs commented Nov 27, 2024

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

[GPU AMI] Incompatibility of EFA, NVIDIA drivers with latest Linux kernels #1494

Comments

cartermckinnon commented Oct 26, 2023 • edited Loading

pfuntner commented Dec 18, 2023

cartermckinnon commented Feb 28, 2024

farioas commented Mar 3, 2024 • edited Loading

IshwarChandra commented Mar 4, 2024

cartermckinnon commented Jul 2, 2024

the-veloper commented Jul 10, 2024 • edited Loading

Rayshard commented Nov 27, 2024

bryantbiggs commented Nov 27, 2024

cartermckinnon commented Oct 26, 2023 •

edited

Loading

farioas commented Mar 3, 2024 •

edited

Loading

the-veloper commented Jul 10, 2024 •

edited

Loading