Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

Seeing nvidia installation errors in 5.3.0-1020-azure-backed GPU nodes #3269

Closed
jackfrancis opened this issue May 15, 2020 · 9 comments · Fixed by #3366
Closed

Seeing nvidia installation errors in 5.3.0-1020-azure-backed GPU nodes #3269

jackfrancis opened this issue May 15, 2020 · 9 comments · Fixed by #3366
Labels
bug Something isn't working gpu GPU-related issues and fixes
Milestone

Comments

@jackfrancis
Copy link
Member

Example:

Error from deployment for kubernetes-westeurope-4290 in resource group kubernetes-westeurope-4290:exit status 1
 2020/05/15 15:26:32 Command Output: Deployment failed. Correlation ID: 7ce64e49-5230-4ea1-85ff-6243e9899bc9. {
   "status": "Failed",
   "error": {
     "code": "ResourceDeploymentFailure",
     "message": "The resource operation completed with terminal provisioning state 'Failed'.",
     "details": [
       {
         "code": "VMExtensionProvisioningError",
         "message": "VM has reported a failure when processing extension 'vmssCSE'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=84\n[stdout]\nFri May 15 15:19:48 UTC 2020,k8s-poolgpu-10230551-vmss000000\n\n[stderr]\n\"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot "
       }
     ]
   }
 }
@jackfrancis jackfrancis added the bug Something isn't working label May 15, 2020
@jackfrancis jackfrancis added this to the Next milestone May 15, 2020
@jackfrancis
Copy link
Member Author

ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.40.04 -k
       5.3.0-1020-azure`: 
       Kernel preparation unnecessary for this kernel.  Skipping...
       
       Building module:
       cleaning build area...
       'make' -j6 NV_EXCLUDE_BUILD_MODULES='nvidia-drm '
       KERNEL_UNAME=5.3.0-1020-azure IGNORE_CC_MISMATCH='1' modules....(bad
       exit status: 2)
       ERROR (dkms apport): binary package for nvidia: 418.40.04 not found
       Error! Bad return status for module build on kernel:
       5.3.0-1020-azure (x86_64)
       Consult /var/lib/dkms/nvidia/418.40.04/build/make.log for more
       information.

@jackfrancis
Copy link
Member Author

Here's the full log output from the failed install:

cat /var/log/nvidia-installer-1589561704.log 
nvidia-installer log file '/var/log/nvidia-installer-1589561704.log'
creation time: Fri May 15 16:55:20 2020
installer version: 418.40.04

PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin

nvidia-installer command line:
    ./nvidia-installer
    -s
    -k=5.3.0-1020-azure
    --log-file-name=/var/log/nvidia-installer-1589561704.log
    -a
    --no-drm
    --dkms
    --utility-prefix=/usr/local/nvidia
    --opengl-prefix=/usr/local/nvidia

Using built-in stream user interface
-> Detected 6 CPUs online; setting concurrency level to 6.
-> Installing NVIDIA driver version 418.40.04.
WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
-> Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later. (Answer: Yes)
WARNING: nvidia-installer was forced to guess the X library path '/usr/lib' and X module path '/usr/lib/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your distribution and reinstall the driver.
-> Install NVIDIA's 32-bit compatibility libraries? (Answer: Yes)
-> Will install GLVND GLX client libraries.
-> Will install GLVND EGL client libraries.
-> Skipping GLX non-GLVND file: "libGL.so.418.40.04"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.418.40.04"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
-> Skipping GLX non-GLVND file: "./32/libGL.so.418.40.04"
-> Skipping GLX non-GLVND file: "libGL.so.1"
-> Skipping GLX non-GLVND file: "libGL.so"
-> Skipping EGL non-GLVND file: "./32/libEGL.so.418.40.04"
-> Skipping EGL non-GLVND file: "libEGL.so"
-> Skipping EGL non-GLVND file: "libEGL.so.1"
Looking for install checker script at ./libglvnd_install_checker/check-libglvnd-install.sh
   executing: '/bin/sh ./libglvnd_install_checker/check-libglvnd-install.sh'...
   Checking for libglvnd installation.
   Checking libGLdispatch...
   Can't load library libGLdispatch.so.0: libGLdispatch.so.0: cannot open shared object file: No such file or directory
Will install libglvnd libraries.
Will install libEGL vendor library config file to /usr/share/glvnd/egl_vendor.d
-> Searching for conflicting files:
-> done.
-> Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (418.40.04):
   executing: '/sbin/ldconfig'...
   /sbin/ldconfig.real: Cannot lstat /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0: No such file or directory
-> done.
-> Driver file installation is complete.
-> Installing DKMS kernel module:
ERROR: Failed to run `/usr/sbin/dkms build -m nvidia -v 418.40.04 -k 5.3.0-1020-azure`: 
Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
'make' -j6 NV_EXCLUDE_BUILD_MODULES='nvidia-drm ' KERNEL_UNAME=5.3.0-1020-azure IGNORE_CC_MISMATCH='1' modules....(bad exit status: 2)
ERROR (dkms apport): binary package for nvidia: 418.40.04 not found
Error! Bad return status for module build on kernel: 5.3.0-1020-azure (x86_64)
Consult /var/lib/dkms/nvidia/418.40.04/build/make.log for more information.
-> error.
ERROR: Failed to install the kernel module through DKMS. No kernel module was installed; please try installing again without DKMS, or check the DKMS logs for more information.
ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer-1589561704.log' for details.  You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

@jackfrancis
Copy link
Member Author

@yangl900 @tonyxu-io @palma21 @jluk FYI, the Azure kernel 5.3.0-1020-azure which is delivered with the current 18.04-LTS image doesn't appear to work with our existing nvidia drivers. We should probably discourage customers from running GPU nodes on 18.04-LTS.

@tonyxu-io
Copy link
Member

@yangl900 @tonyxu-io @palma21 @jluk FYI, the Azure kernel 5.3.0-1020-azure which is delivered with the current 18.04-LTS image doesn't appear to work with our existing nvidia drivers. We should probably discourage customers from running GPU nodes on 18.04-LTS.

I’m not sure if you tagged the right person.

@jackfrancis
Copy link
Member Author

Thanks @tonyxu-io :)

cc @xuto2

@jackfrancis
Copy link
Member Author

Confirmed that the latest 16.04-LTS kernel (4.15.0-1082-azure) accommodates nvidia drivers.

@mboersma mboersma added the gpu GPU-related issues and fixes label May 21, 2020
@xuto2
Copy link
Contributor

xuto2 commented May 27, 2020

@jackfrancis
Copy link
Member Author

@xuto2 I assume we need to solve this first?:

#3307

Do you have a working prototype of how to prepare a N series SKU for the above tesla drivers installation?

@delulu
Copy link
Contributor

delulu commented May 28, 2020

From the description of gpu-operator, it's said that the GPU operator has been validated with NVIDIA Tesla Driver 440.

Also from nvidia driver release notes the Release 440 driver is supported on Ubuntu 18.04.3 LTS, while the Release 418 driver is supported on Ubuntu 18.04.2 LTS.

What is the Ubuntu image used in 5.3.0-1020-azure-backed GPU nodes, could we update the driver version to see if it works?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working gpu GPU-related issues and fixes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants