You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the CrashLoopBackOff happens to the vGPU manager pod. I couldn't find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.
The server is vGPU certified. (Supermicro 1029U-TR4 w/ two NVIDIA T4 GPUs)
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
It's installed using .deb file (Host Drivers) downloaded from NLP - Software Download with the command sudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb.
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.05 Driver Version: 550.90.05 CUDA Version: N/A |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:AF:00.0 Off | Off |
| N/A 53C P8 18W / 70W | 97MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla T4 On | 00000000:D8:00.0 Off | Off |
| N/A 54C P8 17W / 70W | 97MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The text was updated successfully, but these errors were encountered:
tunahanertekin
changed the title
vGPU pods fail after the installation
vGPU pods stuck/fail after the installation
Sep 27, 2024
Hi,
I'm trying to use the GPU Operator with vGPU support following this article on k3s. After I install the operator, vGPU pods stuck at init state, and then the
CrashLoopBackOff
happens to the vGPU manager pod. I couldn't find the root cause or a similar issue from the forum/issues yet. I can provide outputs from the host if requested. Any kind of help is appreciated.intel_iommu
is enabled. (/etc/default/grub
)When I check the allocatable resources on node, I can see the vGPU device that I try to use as below.
Here is my installation command. I disabled the driver and the toolkit because they are available on the host.
Logs of the crashing/stucking pods
Verification of vGPU creation
Host driver
It's installed using
.deb
file (Host Drivers) downloaded from NLP - Software Download with the commandsudo apt install ./nvidia-vgpu-ubuntu-550_550.90.05_amd64.deb
.The text was updated successfully, but these errors were encountered: