Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

Nvidia runtime fails to run any container #1252

Closed
8 tasks done
sdmitriev1 opened this issue Apr 13, 2020 · 5 comments
Closed
8 tasks done

Nvidia runtime fails to run any container #1252

sdmitriev1 opened this issue Apr 13, 2020 · 5 comments

Comments

@sdmitriev1
Copy link

1. Issue or feature description

Run container with nvidia runtime leads to an error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled 

2. Steps to reproduce the issue

docker run --runtime=nvidia nvidia/cuda:10.0-base nvidia-smi

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
I0413 02:22:54.143361 17253 nvc.c:281] initializing library context (version=1.0.7, build=b71f87c04b8eca8a16bf60995506c35c937347d9)
I0413 02:22:54.143387 17253 nvc.c:255] using root /
I0413 02:22:54.143390 17253 nvc.c:256] using ldcache /etc/ld.so.cache
I0413 02:22:54.143393 17253 nvc.c:257] using unprivileged user 1000:1000
W0413 02:22:54.145007 17255 nvc.c:186] failed to set inheritable capabilities
W0413 02:22:54.145036 17255 nvc.c:187] skipping kernel modules load due to failure
I0413 02:22:54.145345 17256 driver.c:133] starting driver service
I0413 02:22:54.749964 17253 nvc_info.c:438] requesting driver information with ''
I0413 02:22:54.750352 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/vdpau/libvdpau_nvidia.so.440.64
I0413 02:22:54.750427 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvoptix.so.440.64
I0413 02:22:54.750499 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-tls.so.440.64
I0413 02:22:54.750540 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-rtcore.so.440.64
I0413 02:22:54.750583 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-ptxjitcompiler.so.440.64
I0413 02:22:54.750670 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-opticalflow.so.440.64
I0413 02:22:54.750739 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-opencl.so.440.64
I0413 02:22:54.750769 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-ml.so.440.64
I0413 02:22:54.750815 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-ifr.so.440.64
I0413 02:22:54.750900 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-glvkspirv.so.440.64
I0413 02:22:54.750969 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-glsi.so.440.64
I0413 02:22:54.751013 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-glcore.so.440.64
I0413 02:22:54.751047 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-fbc.so.440.64
I0413 02:22:54.751132 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-fatbinaryloader.so.440.64
I0413 02:22:54.751199 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-encode.so.440.64
I0413 02:22:54.751263 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-eglcore.so.440.64
I0413 02:22:54.751319 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-compiler.so.440.64
I0413 02:22:54.751373 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-cfg.so.440.64
I0413 02:22:54.751447 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvidia-cbl.so.440.64
I0413 02:22:54.751485 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libnvcuvid.so.440.64
I0413 02:22:54.751561 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libcuda.so.440.64
I0413 02:22:54.751615 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libGLX_nvidia.so.440.64
I0413 02:22:54.751656 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libGLESv2_nvidia.so.440.64
I0413 02:22:54.751701 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libGLESv1_CM_nvidia.so.440.64
I0413 02:22:54.751739 17253 nvc_info.c:152] selecting /var/lib/nvidia/lib64/libEGL_nvidia.so.440.64
W0413 02:22:54.751762 17253 nvc_info.c:307] missing compat32 library libnvidia-ml.so
W0413 02:22:54.751773 17253 nvc_info.c:307] missing compat32 library libnvidia-cfg.so
W0413 02:22:54.751784 17253 nvc_info.c:307] missing compat32 library libcuda.so
W0413 02:22:54.751790 17253 nvc_info.c:307] missing compat32 library libnvidia-opencl.so
W0413 02:22:54.751796 17253 nvc_info.c:307] missing compat32 library libnvidia-ptxjitcompiler.so
W0413 02:22:54.751803 17253 nvc_info.c:307] missing compat32 library libnvidia-fatbinaryloader.so
W0413 02:22:54.751809 17253 nvc_info.c:307] missing compat32 library libnvidia-compiler.so
W0413 02:22:54.751815 17253 nvc_info.c:307] missing compat32 library libvdpau_nvidia.so
W0413 02:22:54.751822 17253 nvc_info.c:307] missing compat32 library libnvidia-encode.so
W0413 02:22:54.751828 17253 nvc_info.c:307] missing compat32 library libnvidia-opticalflow.so
W0413 02:22:54.751834 17253 nvc_info.c:307] missing compat32 library libnvcuvid.so
W0413 02:22:54.751840 17253 nvc_info.c:307] missing compat32 library libnvidia-eglcore.so
W0413 02:22:54.751861 17253 nvc_info.c:307] missing compat32 library libnvidia-glcore.so
W0413 02:22:54.751867 17253 nvc_info.c:307] missing compat32 library libnvidia-tls.so
W0413 02:22:54.751874 17253 nvc_info.c:307] missing compat32 library libnvidia-glsi.so
W0413 02:22:54.751880 17253 nvc_info.c:307] missing compat32 library libnvidia-fbc.so
W0413 02:22:54.751886 17253 nvc_info.c:307] missing compat32 library libnvidia-ifr.so
W0413 02:22:54.751892 17253 nvc_info.c:307] missing compat32 library libnvidia-rtcore.so
W0413 02:22:54.751898 17253 nvc_info.c:307] missing compat32 library libnvoptix.so
W0413 02:22:54.751905 17253 nvc_info.c:307] missing compat32 library libGLX_nvidia.so
W0413 02:22:54.751911 17253 nvc_info.c:307] missing compat32 library libEGL_nvidia.so
W0413 02:22:54.751917 17253 nvc_info.c:307] missing compat32 library libGLESv2_nvidia.so
W0413 02:22:54.751923 17253 nvc_info.c:307] missing compat32 library libGLESv1_CM_nvidia.so
W0413 02:22:54.751929 17253 nvc_info.c:307] missing compat32 library libnvidia-glvkspirv.so
W0413 02:22:54.751935 17253 nvc_info.c:307] missing compat32 library libnvidia-cbl.so
W0413 02:22:54.752436 17253 nvc_info.c:329] missing binary nvidia-smi
W0413 02:22:54.752455 17253 nvc_info.c:329] missing binary nvidia-debugdump
W0413 02:22:54.752461 17253 nvc_info.c:329] missing binary nvidia-persistenced
W0413 02:22:54.752467 17253 nvc_info.c:329] missing binary nvidia-cuda-mps-control
W0413 02:22:54.752473 17253 nvc_info.c:329] missing binary nvidia-cuda-mps-server
I0413 02:22:54.752492 17253 nvc_info.c:370] listing device /dev/nvidiactl
I0413 02:22:54.752499 17253 nvc_info.c:370] listing device /dev/nvidia-uvm
I0413 02:22:54.752505 17253 nvc_info.c:370] listing device /dev/nvidia-uvm-tools
I0413 02:22:54.752511 17253 nvc_info.c:370] listing device /dev/nvidia-modeset
W0413 02:22:54.752588 17253 nvc_info.c:278] missing ipc /var/run/nvidia-persistenced/socket
W0413 02:22:54.752664 17253 nvc_info.c:278] missing ipc /tmp/nvidia-mps
I0413 02:22:54.752682 17253 nvc_info.c:494] requesting device information with ''
I0413 02:22:54.758934 17253 nvc_info.c:524] listing device /dev/nvidia0 (GPU-5d7d29f8-7d6a-beba-8515-b60967679bf6 at 00000000:00:05.0)
NVRM version:   440.64
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-5d7d29f8-7d6a-beba-8515-b60967679bf6
Bus Location:   00000000:00:05.0
Architecture:   7.5
I0413 02:22:54.759027 17253 nvc.c:318] shutting down library context
I0413 02:22:54.759224 17256 driver.c:192] terminating driver service
I0413 02:22:54.831967 17253 driver.c:233] driver service terminated successfully

  • Kernel version from uname -a
    Linux host-192-168-3-10 5.5.10-200.fc31.x86_64 #1 SMP Wed Mar 18 14:21:38 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
    [ 2360.831112] nvc:[driver][14579]: segfault at 38 ip 00007ff1a9fe6082 sp 00007fff7e2df3d0 error 4 in libc-2.30.so[7ff1a9eda000+14f000]
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                           : Mon Apr 13 02:29:33 2020
Driver Version                      : 440.64
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:00:05.0
    Product Name                    : GeForce RTX 2080 Ti
    Product Brand                   : GeForce
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Disabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : N/A
    GPU UUID                        : GPU-5d7d29f8-7d6a-beba-8515-b60967679bf6
    Minor Number                    : 0
    VBIOS Version                   : 90.02.17.00.57
    MultiGPU Board                  : No
    Board ID                        : 0x5
    GPU Part Number                 : N/A
    Inforom Version
        Image Version               : G001.0000.02.04
        OEM Object                  : 1.1
        ECC Object                  : N/A
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : None
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x00
        Device                      : 0x05
        Domain                      : 0x0000
        Device Id                   : 0x1E0710DE
        Bus Id                      : 00000000:00:05.0
        Sub System Id               : 0x150319DA
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 4x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays Since Reset         : 0
        Replay Number Rollovers     : 0
        Tx Throughput               : 1000 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 35 %
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
            HW Thermal Slowdown     : Not Active
            HW Power Brake Slowdown : Not Active
        Sync Boost                  : Not Active
        SW Thermal Slowdown         : Not Active
        Display Clock Setting       : Not Active
    FB Memory Usage
        Total                       : 11019 MiB
        Used                        : 0 MiB
        Free                        : 11019 MiB
    BAR1 Memory Usage
        Total                       : 256 MiB
        Used                        : 2 MiB
        Free                        : 254 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Encoder Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    FBC Stats
        Active Sessions             : 0
        Average FPS                 : 0
        Average Latency             : 0
    Ecc Mode
        Current                     : N/A
        Pending                     : N/A
    ECC Errors
        Volatile
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
        Aggregate
            SRAM Correctable        : N/A
            SRAM Uncorrectable      : N/A
            DRAM Correctable        : N/A
            DRAM Uncorrectable      : N/A
    Retired Pages
        Single Bit ECC              : N/A
        Double Bit ECC              : N/A
        Pending Page Blacklist      : N/A
    Temperature
        GPU Current Temp            : 59 C
        GPU Shutdown Temp           : 94 C
        GPU Slowdown Temp           : 91 C
        GPU Max Operating Temp      : 89 C
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A
    Power Readings
        Power Management            : Supported
        Power Draw                  : 63.14 W
        Power Limit                 : 260.00 W
        Default Power Limit         : 260.00 W
        Enforced Power Limit        : 260.00 W
        Min Power Limit             : 100.00 W
        Max Power Limit             : 300.00 W
    Clocks
        Graphics                    : 1350 MHz
        SM                          : 1350 MHz
        Memory                      : 7000 MHz
        Video                       : 1245 MHz
    Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Default Applications Clocks
        Graphics                    : N/A
        Memory                      : N/A
    Max Clocks
        Graphics                    : 2205 MHz
        SM                          : 2205 MHz
        Memory                      : 7000 MHz
        Video                       : 1950 MHz
    Max Customer Boost Clocks
        Graphics                    : N/A
    Clock Policy
        Auto Boost                  : N/A
        Auto Boost Default          : N/A
    Processes                       : None

  • Docker version from docker version
Client:
 Version:           18.09.8
 API version:       1.39
 Go version:        go1.13beta1
 Git commit:        0dd43dd
 Built:             Fri Jul 26 03:04:01 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.8
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.13beta1
  Git commit:       0dd43dd
  Built:            Thu Jul 25 00:00:00 2019
  OS/Arch:          linux/amd64
  Experimental:     false
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
libnvidia-container-tools-1.0.7-1.x86_64
nvidia-docker-1.0.1-1.x86_64
libnvidia-container1-1.0.7-1.x86_64
nvidia-container-toolkit-1.0.5-2.x86_64
nvidia-container-runtime-3.1.4-1.x86_64
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.0.7
build date: 2020-01-21T19:05+0000
build revision: b71f87c04b8eca8a16bf60995506c35c937347d9
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-39)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

-- WARNING, the following logs are for debugging purposes only --

I0413 02:06:37.102912 14574 nvc.c:281] initializing library context (version=1.0.7, build=b71f87c04b8eca8a16bf60995506c35c937347d9)
I0413 02:06:37.102933 14574 nvc.c:255] using root /
I0413 02:06:37.102937 14574 nvc.c:256] using ldcache /etc/ld.so.cache
I0413 02:06:37.102940 14574 nvc.c:257] using unprivileged user 65534:65534
I0413 02:06:37.104046 14578 nvc.c:191] loading kernel module nvidia
I0413 02:06:37.104161 14578 nvc.c:203] loading kernel module nvidia_uvm
I0413 02:06:37.104287 14578 nvc.c:211] loading kernel module nvidia_modeset
I0413 02:06:37.104470 14579 driver.c:133] starting driver service
I0413 02:06:37.105519 14579 driver.c:192] terminating driver service
I0413 02:06:37.110923 14574 driver.c:233] driver service terminated with signal 11
@josiahlaivins
Copy link

nvidia-docker2 or nvidia-docker? I have the same issue. I think that runtime=nvidia is depreciated. You're supppsed to use the device runtime I guess like --gpus all.

@sdmitriev1
Copy link
Author

I believe "--gpus all" is supported in docker starting from 19.03, but there we have 18.09.8

@ardenpm
Copy link

ardenpm commented May 19, 2020

I am having this issue with nvidia-docker2 which while deprecated, is stated by the k8s-device-plugin as requried since K8s doesn't yet recognise the --gpus option for Docker. If I use the nvidia-container-toolkit and the --gpus option everything works perfectly, however if I use nvidia-docker2 I get the issue reported here, which means I can't currently get it running with K8s.

@klueska
Copy link
Contributor

klueska commented May 22, 2020

This doesn't address your issue directly, but hopefully the following link helps clear up the confusion around nvidia-docker2, nvidia-container-toolkit and the deprecation announcement.

#1268 (comment)

@ardenpm
Copy link

ardenpm commented May 24, 2020

Yes, that makes it very clear. In the end I had to switch from CentOS to Ubuntu to get things running unfortunately due to the issue I mentioned in this comment.

Hopefully K8s directly supports this soon and things can be unified.

@elezar elezar closed this as completed Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants