vGPU pods stuck/fail after the installation #1018

tunahanertekin · 2024-09-27T17:41:01Z

Hi,

I'm trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the CrashLoopBackOff happens to the vGPU manager pod. I couldn't find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.

The server is vGPU certified. (Supermicro 1029U-TR4 w/ two NVIDIA T4 GPUs)
SR-IOV is enabled. (BIOS)
VT-d is enabled. (BIOS)
intel_iommu is enabled. (/etc/default/grub)

gpu-operator   gpu-operator-1727453251-node-feature-discovery-gc-854cf464lp2ck   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-master-8656xbdjm   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-worker-25mqp       1/1     Running     0                53m
gpu-operator   gpu-operator-84c6b4697b-hlshg                                     1/1     Running     0                53m
gpu-operator   nvidia-sandbox-device-plugin-daemonset-8vp72                      1/1     Running     0                50m
gpu-operator   nvidia-sandbox-validator-zw6w7                                    1/1     Running     0                53m
gpu-operator   nvidia-vgpu-device-manager-zddgb                                  0/1     Init:0/1    0                40m
gpu-operator   nvidia-vgpu-manager-daemonset-4cm5s                               0/1     Init:0/1    12 (5m42s ago)   49m

When I check the allocatable resources on node, I can see the vGPU device that I try to use as below.

allocatable:
  cpu: "80"
  ephemeral-storage: "4411267110320"
  hugepages-1Gi: "0"
  hugepages-2Mi: 2Gi
  memory: 261654520Ki
  nvidia.com/GRID_T4-2Q: "1"
  pods: "110"

Here is my installation command. I disabled the driver and the toolkit because they are available on the host.

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set sandboxWorkloads.enabled=true \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set vgpuManager.enabled=true \
  --set vgpuManager.repository=${PRIVATE_REGISTRY} \
  --set vgpuManager.image=vgpu-manager \
  --set vgpuManager.version=550.90.05 \
  --set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}

Logs of the crashing/stucking pods

kubectl logs -f nvidia-vgpu-manager-daemonset-4cm5s -n gpu-operator -c k8s-driver-manager

NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
node/robolaunch-internal labeled

kubectl logs -f nvidia-vgpu-device-manager-zddgb -n gpu-operator -c vgpu-manager-validation

waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup

Verification of vGPU creation

mdevctl list

16d7dda2-f888-4c28-9c3c-2352daa88a8c 0000:af:00.0 nvidia-231 (defined)

Host driver

It's installed using .deb file (Host Drivers) downloaded from NLP - Software Download with the command sudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb.

nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:AF:00.0 Off |                  Off |
| N/A   53C    P8             18W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       On  |   00000000:D8:00.0 Off |                  Off |
| N/A   54C    P8             17W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

tunahanertekin changed the title ~~vGPU pods fail after the installation~~ vGPU pods stuck/fail after the installation Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vGPU pods stuck/fail after the installation #1018

vGPU pods stuck/fail after the installation #1018

tunahanertekin commented Sep 27, 2024 •

edited

Loading

vGPU pods stuck/fail after the installation #1018

vGPU pods stuck/fail after the installation #1018

Comments

tunahanertekin commented Sep 27, 2024 • edited Loading

Logs of the crashing/stucking pods

Verification of vGPU creation

Host driver

tunahanertekin commented Sep 27, 2024 •

edited

Loading