Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vGPU pods stuck/fail after the installation #1018

Open
tunahanertekin opened this issue Sep 27, 2024 · 0 comments
Open

vGPU pods stuck/fail after the installation #1018

tunahanertekin opened this issue Sep 27, 2024 · 0 comments

Comments

@tunahanertekin
Copy link

tunahanertekin commented Sep 27, 2024

Hi,

I'm trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the CrashLoopBackOff happens to the vGPU manager pod. I couldn't find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.

  • The server is vGPU certified. (Supermicro 1029U-TR4 w/ two NVIDIA T4 GPUs)
  • SR-IOV is enabled. (BIOS)
  • VT-d is enabled. (BIOS)
  • intel_iommu is enabled. (/etc/default/grub)
gpu-operator   gpu-operator-1727453251-node-feature-discovery-gc-854cf464lp2ck   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-master-8656xbdjm   1/1     Running     0                53m
gpu-operator   gpu-operator-1727453251-node-feature-discovery-worker-25mqp       1/1     Running     0                53m
gpu-operator   gpu-operator-84c6b4697b-hlshg                                     1/1     Running     0                53m
gpu-operator   nvidia-sandbox-device-plugin-daemonset-8vp72                      1/1     Running     0                50m
gpu-operator   nvidia-sandbox-validator-zw6w7                                    1/1     Running     0                53m
gpu-operator   nvidia-vgpu-device-manager-zddgb                                  0/1     Init:0/1    0                40m
gpu-operator   nvidia-vgpu-manager-daemonset-4cm5s                               0/1     Init:0/1    12 (5m42s ago)   49m

When I check the allocatable resources on node, I can see the vGPU device that I try to use as below.

allocatable:
  cpu: "80"
  ephemeral-storage: "4411267110320"
  hugepages-1Gi: "0"
  hugepages-2Mi: 2Gi
  memory: 261654520Ki
  nvidia.com/GRID_T4-2Q: "1"
  pods: "110"

Here is my installation command. I disabled the driver and the toolkit because they are available on the host.

helm install --wait --generate-name \
  -n gpu-operator --create-namespace \
  nvidia/gpu-operator \
  --set sandboxWorkloads.enabled=true \
  --set driver.enabled=false \
  --set toolkit.enabled=false \
  --set vgpuManager.enabled=true \
  --set vgpuManager.repository=${PRIVATE_REGISTRY} \
  --set vgpuManager.image=vgpu-manager \
  --set vgpuManager.version=550.90.05 \
  --set vgpuManager.imagePullSecrets={${REGISTRY_SECRET_NAME}}

Logs of the crashing/stucking pods

kubectl logs -f nvidia-vgpu-manager-daemonset-4cm5s -n gpu-operator -c k8s-driver-manager
NVIDIA GPU driver is already pre-installed on the node, disabling the containerized driver on the node
node/robolaunch-internal labeled
kubectl logs -f nvidia-vgpu-device-manager-zddgb -n gpu-operator -c vgpu-manager-validation
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup

Verification of vGPU creation

mdevctl list
16d7dda2-f888-4c28-9c3c-2352daa88a8c 0000:af:00.0 nvidia-231 (defined)

Host driver

It's installed using .deb file (Host Drivers) downloaded from NLP - Software Download with the command sudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb.

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05              Driver Version: 550.90.05      CUDA Version: N/A      |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:AF:00.0 Off |                  Off |
| N/A   53C    P8             18W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       On  |   00000000:D8:00.0 Off |                  Off |
| N/A   54C    P8             17W /   70W |      97MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
@tunahanertekin tunahanertekin changed the title vGPU pods fail after the installation vGPU pods stuck/fail after the installation Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant