Integrate the NVIDIA container toolkit #52

ehfd · 2023-11-18T15:24:39Z

This is an issue that has been spun off from the Discord channel.

@Murazaki : It might be good to find a better workflow for providing drivers to Wolf.
On Debian, drivers are pretty old in the main stable repo, and updated ones can be found on CUDA drivers, but do not exactly match manual installation ones.

@ABeltramo : I guess I should go back to look into the Nvidia Docker Toolkit for people that would like to use that
I agree though, it's a bit of a pain point at the moment

@Murazaki : Cuda drivers repo :
https://developer.download.nvidia.com/compute/cuda/repos/

Linux manual installer :
https://download.nvidia.com/XFree86/Linux-x86_64/

right now, latest in cuda packaged installs is 545.23.08.
It doesn't exist as a manual installer.
that breaks the dockerfile and renders wolf unusable
I wanted to make a docker image for debian packages install, but it uses apt-add-repository which is installing a bunch of supplementary stuff
Here it is for Debian Bookworm :
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=12&target_type=deb_network

More thorough installation steps here :
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

@juliosueiras : There is one problem though, nvidia driver toolkit doesn’t inject driver in the container and still require a driver installed in the container image itself,

And here, I start with what interventions I made with NVIDIA for the last three years to not require the NVIDIA drivers to run Wayland inside the NVIDIA container toolkit.

What NVIDIA container toolkit does: it's pretty simple. It injects (1) kernel devices, and (2) userspace libraries, into a container. (1) and (2) compose a subset of the driver.

(1) kernel devices: /dev/nvidiaN, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools. In addition, /dev/dri/cardX and /dev/dri/renderDY, where N, X, and Y depend on the GPU the container toolkit provisions. The /dev/dri devices were added with NVIDIA/libnvidia-container#118.

(2) userspace libraries:
OpenGL libraries including EGL: '/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0', '/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'

Vulkan libraries: '/usr/lib/libGLX_nvidia.so.0' and the configuration '/etc/vulkan/icd.d/nvidia_icd.json'

EGLStreams-Wayland and GBM-Wayland libraries: '/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json' '/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'

NVENC libraries: /usr/lib/libnvidia-encode.so.525.78.01, which depends on /usr/lib/libnvcuvid.so.525.78.01, which depends on /usr/lib/x86_64-linux-gnu/libcuda.so.1

VDPAU libraries: /usr/lib/vdpau/libvdpau_nvidia.so.525.78.01
NVFBC libraries: /usr/lib/libnvidia-fbc.so.525.78.01
OPTIX libraries: /usr/lib/libnvoptix.so.1

Not very relevant but of note, perhaps for XWayland: NVIDIA X.Org driver: /usr/lib/xorg/modules/drivers/nvidia_drv.so, NVIDIA X.org GLX driver: /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

In many cases, things don't work because the below configuration files are absent inside the container. Without these, applications inside the container don't know which library to call (what each file does is self-explanatory):

The contents of /usr/share/glvnd/egl_vendor.d/10_nvidia.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}

The contents of /etc/vulkan/icd.d/nvidia_icd.json (note that api_version is variable based on the Driver version):

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.205"
    }
}

The contents of /etc/OpenCL/vendors/nvidia.icd:

libnvidia-opencl.so.1

The contents of /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json:

{
        "file_format_version" : "1.0.0",
        "ICD" : {
                "library_path" : "libnvidia-egl-gbm.so.1"
        }
}

The contents of /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libnvidia-egl-wayland.so.1"
    }
}

I'm pretty sure that now (was different a few months ago), the newest NVIDIA container toolkit provisions all of the required libraries plus the json configurations for Wayland (not for X11 but you don't have to care).
If only the json configurations are absent, it's trivial to manually add the above template.

About GStreamer:
https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3108

Now, it is correct that NVENC does require CUDA. But that doesn't mean that it requires the whole CUDA Toolkit (separate from the CUDA drivers). The CUDA drivers are the four following libraries installed with the display drivers, independent of the CUDA Toolkit: libcuda.so, libnvidia-ptxjitcompiler.so, libnvidia-nvvm.so, libcudadebugger.so

These versions go with the display drivers, and are all injected into the container by the NVIDIA container toolkit.

GStreamer 1.22 and before in nvcodec requires just two files of the CUDA Toolkit: libnvrtc.so and libnvrtc-bulletins.so. This can be installed from the network repository like the current approach, or be extracted from a PyPi package:

# Extract NVRTC dependency, https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/LICENSE.txt
cd /tmp && curl -fsSL -o nvidia_cuda_nvrtc_linux_x86_64.whl "https://developer.download.nvidia.com/compute/redist/nvidia-cuda-nvrtc/nvidia_cuda_nvrtc-11.0.221-cp36-cp36m-linux_x86_64.whl" && unzip -joq -d ./nvrtc nvidia_cuda_nvrtc_linux_x86_64.whl && cd nvrtc && chmod 755 libnvrtc* && find . -maxdepth 1 -type f -name "*libnvrtc.so.*" -exec sh -c 'ln -snf $(basename {}) libnvrtc.so' \; && mv -f libnvrtc* /opt/gstreamer/lib/x86_64-linux-gnu/ && cd /tmp && rm -rf /tmp/*

One thing to note here is that libnvrtc.so is not minor version compatible with CUDA. Thus, it will error on any display driver version older than its corresponding display driver version. However, backwards compatibility always works. Thus, it is a good idea to use the oldest possible libnvrtc.so version.

Display CUDA
545 - 12.3
535 - 12.2
530 - 12.1
525 - 12.0
520 - 11.8
515 - 11.7
(and so on...)

https://docs.nvidia.com/deploy/cuda-compatibility/

So, I have moderate to high confidence that if you guys try the newest NVIDIA container toolkit again, you won't need to install the drivers, assuming that you ensure the json files are present or written.

Environment variables that currently work for me:

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
# Expose NVIDIA libraries and paths
ENV PATH /usr/local/nvidia/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu:/usr/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
# Make all NVIDIA GPUs visible by default
ENV NVIDIA_VISIBLE_DEVICES all
# All NVIDIA driver capabilities should preferably be used, check `NVIDIA_DRIVER_CAPABILITIES` inside the container if things do not work
ENV NVIDIA_DRIVER_CAPABILITIES all
# Disable VSYNC for NVIDIA GPUs
ENV __GL_SYNC_TO_VBLANK 0

The text was updated successfully, but these errors were encountered:

ABeltramo · 2023-11-25T08:51:58Z

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library.
Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:

ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@

Anyone that can confirm the output of

docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1	   /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02

on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

ehfd · 2024-01-09T21:39:48Z

@ABeltramo One comment I have here is that there isn't really a concept called an "Xorg host" and a "Wayland host". It depends on what the desktop environment and login manager uses, and by default, all drivers bundle both libraries.

We will discuss more in NVIDIA/libnvidia-container#118.

ohayak · 2024-02-12T13:06:13Z

Hi,
I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations:
22.04-vulkan: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client + PulseAudio Server
22.04-cudagl11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client + PulseAudio Server
22.04-cudagl12: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client + PulseAudio Server
22.04-vulkan-noaudio: Ubuntu 22.04 + OpenGL + Vulkan (no audio support)
22.04-cudagl11-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 (no audio support)
22.04-cudagl12-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 (no audio support)
22.04-vulkan-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server)
22.04-cudagl11-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server)
22.04-cudagl12-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server)
22.04-vulkan-x11: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server) + X11
22.04-cudagl11-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server) + X11
22.04-cudagl12-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server) + X11
I think you should check the docker files

Murazaki · 2024-02-12T15:49:19Z

Hi, I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations
[...]
I think you should check the docker files

Could you provide a link to that/those Dockerfiles maybe ?

Murazaki · 2024-02-15T22:24:52Z

Oh sorry I actually know what they're talking about : it's adamrehn/ue4-runtime. I think I wrote an answer and forgot to send ^^

https://github.com/adamrehn/ue4-runtime

ABeltramo · 2024-02-15T22:49:40Z

If those are the images that @ohayak was talking about, unfortunately, there's nothing there that can help us.
As I explained in a comment above, the nvidia toolkit is not mounting all libraries that are needed in order to spawn our custom Wayland compositor. There are only a few possible solutions that I can think of:

Ask users to install the dependencies in a volume that will be mounted and shared with the apps (the current solution in Wolf)
Ask users to manually install the libraries if missing from the host and then fill out all the correct mounts
Try to fill a patch upstream with the nvidia driver toolkit project and hope that it gets merged

I'm very open to suggestions or alternative solutions!

ehfd · 2024-02-25T08:41:11Z

modprobe -r nvidia_drm ; modprobe nvidia_drm modeset=1

Our experience was related to the above kernel module.
Also, we have reinstalled the OS and installed everything from scratch, then things started to work again.
One more possibility is the lack of DKMS (required to build relevant NVIDIA kernel modules) or a kernel version upgrade without rebuilding the NVIDIA modules.

ehfd · 2024-03-28T09:41:26Z

The core issue is whether the EGL Wayland library is installed or not, likely not the container toolkit.

This is available if you use the .run file.
https://download.nvidia.com/XFree86/Linux-x86_64/550.67/README/installedcomponents.html

I don't think the Debian/Ubuntu PPA repositories install the Wayland components automatically.
If so, the following components need to be installed. This is sufficient and contains all missing Wayland files.

libnvidia-egl-gbm1
libnvidia-egl-wayland1

But, solution: install the .run file.

https://community.kde.org/Plasma/Wayland/Nvidia

tux-rampage · 2024-04-03T14:02:36Z

Hi,

I am currently trying to get this flying with CRI-o and the nvidia-ctk by using the runtime and CDI config. Afair this can be used in docker as well.
So far all configs and libs are injected/mounted into the container as far as I can see.
I can double check for the reported libs here later.

Currently I'm facing the vblank resource unavailable issue. (Driver version 550.something - cannot look it up atm)

tux-rampage · 2024-04-05T06:24:40Z

Here are some short steps for what I've done so far (Nvidia Driver version 550.54.14):

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia-ctk runtime configure --runtime=crio

For docker I guess it's enough to use --runtime=docker

I used the nvidia runtime when starting the container(s) and setting the following Env-Vars:

NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_VISIBLE_DEVICES="nvidia.com/gpu=all"

As documented, this enables the CDI integration which will mount the host libs and binaries.

What is working so far:

vkgears (directly)
vkgears (gamescope)

What is not working:

glxgears (gamescope): Failed to load driver zink, Blank surface, success render output on STDOUT
steam (gamescope): Failed to load driver zink, vblank sync failures, shm failures seems to be webkit sandbox related, possibly Nvidia 545.x/550.x Gamescope issues #60

tux-rampage · 2024-04-09T18:03:58Z

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library. Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:
libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so
Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:
ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@
Anyone that can confirm the output of
docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1	   /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1		       /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1		   /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

ls -1 /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.54.14

It seems that libnvidia-egl-wayland.so.1 and libnvidia-vulkan-producer.so is missing. driver version is 550.54.14

tux-rampage · 2024-04-09T18:11:22Z

libnvidia-egl-wayland seems to be present in the package libnvidia-egl-wayland1.
libnvidia-vulkan-producer.so seems to be dropped from the driver. Need to confirm by building the driver volume for the current release.

Edit: Is it possible that libnvidia-egl-wayland1 is not part of Nvidias driver but an oss component?

Edit2: yes libnvidia-vulkan-producer was removed recently: https://www.nvidia.com/Download/driverResults.aspx/214102/en-us/

Removed libnvidia-vulkan-producer.so from the driver package. This helper library is no longer needed by the Wayland WSI.

ehfd · 2024-04-10T06:08:10Z

Those libraries are for the EGLStreams backend. I believe compositors have now stopped supporting them.

https://github.com/NVIDIA/egl-wayland

tux-rampage · 2024-04-10T08:50:10Z

Maybe. The latest Nvidia Driver comes with a gbm backend. Maybe that's something useful. I'll give GBM_BACKEND=nvidia-gbm a try. Maybe this will help: https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gbm.html

From my approach yesterday evening, glxgears is running without any error messages so far. But the Output in moonlight stays black.

ehfd · 2024-05-10T14:14:26Z

I think NVIDIA Container Toolkit 1.15.0 (released not long ago) fixes most of the problems.

Please check it out.

I am trying to fix the remainder of issues with NVIDIA/nvidia-container-toolkit#490. Please feedback.

I've written about the situation more detailedly in: NVIDIA/nvidia-container-toolkit#490 (comment)

Within the scope of Wolf, the libnvidia-egl-wayland1 APT package will install the EGLStreams interface (if it can be used instead of GBM), and libnvidia-egl-gbm is installed with the Graphics Drivers PPA. Both are installed with the .run installer and the above PR will also inject libnvidia-egl-wayland.

tux-rampage · 2024-06-01T22:26:58Z

I've finally been successful to run steam and cyberpunk with wolf using the nvidia container toolkit instead of the drivers image.
As @ehfd metioned, the Symlink nvidia-drm_gbm.so is missing, so I had to create it manually before running gamescope/steam:

mkdir -p /usr/lib/x86_64-linux-gnu/gbm;
ln -sv ../libnvidia-allocator.so.1 /usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so;

After that launching steam and the game was successful. This all works without the use of libnvidia-egl-wayland.

ABeltramo · 2024-06-02T08:13:34Z

That sounds really good, thanks for reporting back!
We could easily add that to the images, my only concern is that this only works with a specific version of the toolkit. We should probably add a check on startup for the required libraries and print a proper error message if missing..

ehfd · 2024-06-02T08:38:24Z

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for NVIDIA/nvidia-container-toolkit#490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

tux-rampage · 2024-06-04T07:13:54Z

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for NVIDIA/nvidia-container-toolkit#490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

I can take a look at it later

tux-rampage · 2024-06-05T05:01:09Z

@ehfd Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

ehfd · 2024-06-05T05:19:14Z

Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

No sweat. You know Go better than me and seems like you did a great job at it.

kayakyakr · 2024-06-27T20:36:50Z

Congrats on getting this merged! This is going to substantially simplify getting the drivers up and running

ABeltramo · 2024-06-29T08:03:06Z

I've tried upgrading Gstreamer to 1.24.5 unfortunately it's now failing to use Cuda with:

0:00:00.102290171   172 0x560d70ab1190 ERROR              cudanvrtc gstcudanvrtc.cpp:165:gst_cuda_nvrtc_load_library_once: Failed to load 'nvrtcGetCUBINSize', 'nvrtcGetCUBINSize': /usr/local/nvidia/lib/libnvrtc.so: undefined symbol: nvrtcGetCUBINSize

It looks like nvrtcGetCUBINSize was recently added, so I guess there's a mismatch between the libnvrtc.so and what Gstreamer expects. Seems that this was added in this commit which reference @ehfd issue.
I could successfully run this Gstreamer version on my host which has Cuda12 installed, I guess this will break compatibility with older cards so I'm going to revert back Gstreamer in Wolf for now..

see: #52 (comment)

ehfd · 2024-06-29T08:08:01Z

@ABeltramo This should not be an issue as long as NVRTC CUDA version is kept at around 11.3, yes, 11.0 will not work.

ABeltramo · 2024-06-29T08:11:23Z

Thanks for the very quick reply! What's the compatibility matrix for NVRTC? Would upgrading to 11.3 still work for older Cuda installations?

ehfd · 2024-06-29T08:30:33Z

Mostly the internal ABI of NVIDIA.
CUBIN was designed so that there can be forward compatibility for the NVRTC files, but since CUBIN itself didn't exist backwardly, there's a problem here.

You can probably fix the issue yourself on GStreamer and then backport it to GStreamer 1.24.6 by a simple error handling in the C code. Probably #ifdef can work.

ABeltramo · 2024-06-29T08:43:45Z

Thanks, I've got enough on my plate already.. This is definitely lower priority compared to the rest, looks like we are going to stay on 1.22.7 for a bit longer..

ehfd · 2024-07-09T14:51:52Z

The pull request was technically merged today because it was reverted.

NVIDIA/nvidia-container-toolkit#548

ehfd · 2024-07-24T03:30:39Z

I've tried upgrading Gstreamer to 1.24.5 unfortunately it's now failing to use Cuda with nvrtcGetCUBINSize

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/7223

This issue has been resolved for 1.24.6.

ABeltramo · 2024-07-24T18:35:33Z

I've tried upgrading Gstreamer to 1.24.5 unfortunately it's now failing to use Cuda with nvrtcGetCUBINSize

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/7223

This issue has been resolved for 1.24.6.

Thanks for the heads-up, I'm going to test it out again in the next few days!

leiserfg · 2024-09-08T16:49:07Z

@tux-rampage did you manage to make it work with CDI? I was trying myself (cause in NixOS the old nvidia wrapper method is now deprecated) but didn't manage to make wolf to use my gpu.

leiserfg · 2024-09-08T19:33:35Z

I discovered that CDI overwrites /usr/local/nvidia/lib/ by mounting a folder there, which makes libnvrtc* inaccessible. Would it make sense to copy those libraries somewhere else?

ABeltramo · 2024-09-08T20:48:56Z

Copying the libraries might break stuff when there's an update in the Nvidia libraries, and you inevitably upgrade the installation.

I've never used CDI myself, but normally with the nvidia runtime the amount and type of libraries that gets mounted from the host is controlled with the NVIDIA_DRIVER_CAPABILITIES env variable. Have you set that or the equivalent of that in order to pull all libraries? Are you running the latest container toolkit version?

sudo nvidia-container-cli -V

Must be >= 1.16.0

leiserfg · 2024-09-09T08:26:01Z

@ABeltramo The thing is that with CDI (which is the adviced way of using nvidia with docker now) docker mounts everything defined in a json. One of the entries has "containerPath": "/usr/local/nvidia/lib" which makes libnvrtc* inaccessible as there is where those libraries are copied in docker image

wolf/docker/gpu-drivers.Dockerfile

Line 36 in 36f407b

mv -f libnvrtc* /usr/local/nvidia/lib

leiserfg · 2024-09-09T08:30:15Z

This is the docker-compose I'm trying to use and I get the the error of not finding libnvrtc.so

services:
  wolf:
    image: ghcr.io/games-on-whales/wolf:stable
    environment:
      - XDG_RUNTIME_DIR=/tmp/sockets
      - HOST_APPS_STATE_FOLDER=/etc/wolf
      - NVIDIA_DRIVER_CAPABILITIES=all
      - NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all
    volumes:
      - /etc/wolf/:/etc/wolf/
      - /tmp/sockets:/tmp/sockets:rw
      - /var/run/docker.sock:/var/run/docker.sock:rw
      - /dev/:/dev/:rw
      - /run/udev:/run/udev:rw
    device_cgroup_rules:
      - 'c 13:* rmw'
    devices:
      - /dev/dri
      - /dev/uinput
      - /dev/uhid
    deploy:
      resources:
        reservations:
          devices:
           - driver: cdi
             device_ids:
                - nvidia.com/gpu=all
    network_mode: host
    restart: unless-stopped

ABeltramo · 2024-09-09T17:31:28Z

@ABeltramo The thing is that with CDI (which is the adviced way of using nvidia with docker now) docker mounts everything defined in a json. One of the entries has "containerPath": "/usr/local/nvidia/lib" which makes libnvrtc* inaccessible as there is where those libraries are copied in docker image

wolf/docker/gpu-drivers.Dockerfile

Line 36 in 36f407b

mv -f libnvrtc* /usr/local/nvidia/lib

Sorry, my bad. I've misunderstood what you meant, and I forgot about that extra library that we added.
Sure, we can easily move that in a different place and just update the ldconfig; that stuff predates the container toolkit solution, and it probably is safer to move into a non-standard location.

Have you tried re-building the image with that change and see if it fixes it? I'm going to try it out moving that lib and using Wolf with the toolkit. If it doesn't break it, I'm going to push it later so that a new image will be built automatically.

leiserfg · 2024-09-09T18:37:46Z

I tried with moving libnvrtc* to /usr/lib/ that way it was found, but when I started an app (firefox) in wolf, it crashed.

| libEGL warning: egl: failed to create dri2 screen
wolf-1  | 2024-09-09T18:31:57.158112Z ERROR smithay::backend::egl::ffi: [EGL] 0x3001 (NOT_INITIALIZED) eglInitialize: DRI2: failed to create screen

leiserfg · 2024-09-09T18:38:58Z

this is the stacktrace (sorry to add this here, no idea if you wanna make it a different issue):

wolf-1  |    0:     0x7f573666c575 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd736fd5964392270
wolf-1  |    1:     0x7f57366900fb - core::fmt::write::hc6043626647b98ea
wolf-1  |    2:     0x7f57366697df - std::io::Write::write_fmt::h0d24b3e0473045db
wolf-1  |    3:     0x7f573666c34e - std::sys_common::backtrace::print::h45eb8174d25a1e76
wolf-1  |    4:     0x7f573666d899 - std::panicking::default_hook::{{closure}}::haf3f0170eb4f3b53
wolf-1  |    5:     0x7f573666d63a - std::panicking::default_hook::hb5d3b27aa9f6dcda
wolf-1  |    6:     0x7f573666dd33 - std::panicking::rust_panic_with_hook::h6b49d59f86ee588c
wolf-1  |    7:     0x7f573666dc14 - std::panicking::begin_panic_handler::{{closure}}::hd4c2f7ed79b82b70
wolf-1  |    8:     0x7f573666ca39 - std::sys_common::backtrace::__rust_end_short_backtrace::h2946d6d32d7ea1ad
wolf-1  |    9:     0x7f573666d947 - rust_begin_unwind
wolf-1  |   10:     0x7f5736456f93 - core::panicking::panic_fmt::ha02418e5cd774672
wolf-1  |   11:     0x7f5736457426 - core::result::unwrap_failed::h55f86ada3ace5ed2
wolf-1  |   12:     0x7f57364ba2f5 - waylanddisplaycore::comp::init::h0646e1a9d0f59f64
wolf-1  |   13:     0x7f5736474cd0 - std::sys_common::backtrace::__rust_begin_short_backtrace::h3e12ecad4fbffac1
wolf-1  |   14:     0x7f5736478a7e - core::ops::function::FnOnce::call_once{{vtable.shim}}::h66c8be57d5da151d
wolf-1  |   15:     0x7f573667074b - std::sys::pal::unix::thread::Thread::new::thread_start::hb85dbfa54ba503d6
wolf-1  |   16:     0x7f573329ca94 - <unknown>
wolf-1  |   17:     0x7f5733329a34 - __clone
wolf-1  | 2024-09-09T18:31:57.170706Z ERROR waylanddisplaycore: Compositor thread panic'ed! err=Any { .. }
wolf-1  | thread '<unnamed>' panicked at /tmp/gst-wayland-display/wayland-display-core/src/lib.rs:87:41:
wolf-1  | called `Result::unwrap()` on an `Err` value: RecvError
wolf-1  | stack backtrace:
wolf-1  |    0:     0x7f573666c575 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd736fd5964392270
wolf-1  |    1:     0x7f57366900fb - core::fmt::write::hc6043626647b98ea
wolf-1  |    2:     0x7f57366697df - std::io::Write::write_fmt::h0d24b3e0473045db
wolf-1  |    3:     0x7f573666c34e - std::sys_common::backtrace::print::h45eb8174d25a1e76
wolf-1  |    4:     0x7f573666d899 - std::panicking::default_hook::{{closure}}::haf3f0170eb4f3b53
wolf-1  |    5:     0x7f573666d63a - std::panicking::default_hook::hb5d3b27aa9f6dcda
wolf-1  |    6:     0x7f573666dd33 - std::panicking::rust_panic_with_hook::h6b49d59f86ee588c
wolf-1  |    7:     0x7f573666dc14 - std::panicking::begin_panic_handler::{{closure}}::hd4c2f7ed79b82b70
wolf-1  |    8:     0x7f573666ca39 - std::sys_common::backtrace::__rust_end_short_backtrace::h2946d6d32d7ea1ad
wolf-1  |    9:     0x7f573666d947 - rust_begin_unwind
wolf-1  |   10:     0x7f5736456f93 - core::panicking::panic_fmt::ha02418e5cd774672
wolf-1  |   11:     0x7f5736457426 - core::result::unwrap_failed::h55f86ada3ace5ed2
wolf-1  |   12:     0x7f573645e0a3 - waylanddisplaycore::MaybeRecv<T>::get::hfdc57fb8b9f1c5c8
wolf-1  |   13:     0x7f573645f74a - display_get_devices_len
wolf-1  |   14:     0x5602cd93be74 - _ZN4wolf4core15virtual_display22create_wayland_displayERKN5immer5arrayINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS2_13memory_policyINS2_21free_list_heap_policyINS2_8cpp_heapELm1024EEENS2_15refcount_policyENS2_15spinlock_policyENS2_20no_transience_policyELb0ELb1EEEEERKS9_
wolf-1  |                                at /wolf/src/core/src/platforms/linux/virtual-display/wayland-display.cpp:30:36
wolf-1  |   15:     0x5602cd735971 - _ZZZ23setup_sessions_handlersRKN5immer3boxIN5state8AppStateENS_13memory_policyINS_21free_list_heap_policyINS_8cpp_heapELm1024EEENS_15refcount_policyENS_15spinlock_policyENS_20no_transience_policyELb0ELb1EEEEERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt8optionalI11AudioServerEENK3$_2clERKNS0_INS1_13StreamSessionESA_EEENKUlvE_clEv
wolf-1  |                                at /wolf/src/moonlight-server/wolf.cpp:225:29
wolf-1  |   16:     0x7f57336eabb4 - <unknown>
wolf-1  |   17:     0x7f573329ca94 - <unknown>
wolf-1  |   18:     0x7f5733329a34 - __clone
wolf-1  | fatal runtime error: failed to initiate panic, error 5

ABeltramo · 2024-09-09T19:32:39Z

Sounds like CDI isn't mounting the right libraries from the host, definitely worth opening up a separate issue for this.
It would be helpful to report:

General system info, nvidia driver version (nvidia-smi), nvidia toolkit version nvidia-container-cli -V, GPU and if it's running with the latest wolf:stable
Does the same problem occur even when running wolf without CDI and with the standard Container Toolkit (copy paste the command in the quickstart guide

tux-rampage · 2024-09-19T10:32:48Z

@tux-rampage did you manage to make it work with CDI? I was trying myself (cause in NixOS the old nvidia wrapper method is now deprecated) but didn't manage to make wolf to use my gpu.

Yes, but I use CRIO. Does your Docker instance use the nvidia runtime? If not, the drivers and devices will not be populated.

ABeltramo added help wanted wontfix labels Jan 24, 2024

ABeltramo closed this as completed in 53116ef Feb 15, 2024

ABeltramo reopened this Feb 15, 2024

ehfd mentioned this issue Jun 2, 2024

Inject additional libraries required for full display functionality NVIDIA/nvidia-container-toolkit#490

Merged

ABeltramo mentioned this issue Jul 2, 2024

Support for nvidia container toolkit #84

Merged

ABeltramo closed this as completed in #84 Jul 24, 2024

This was referenced Sep 23, 2024

Nvidia: using container toolkit results in EGL exception #124

Open

Looking for help installing Lutris #127

Closed

Integrate the NVIDIA container toolkit #52

Integrate the NVIDIA container toolkit #52

Comments

ehfd commented Nov 18, 2023 • edited Loading

ABeltramo commented Nov 25, 2023

What's the issue?

What can we do better?

ehfd commented Jan 9, 2024 • edited Loading

ohayak commented Feb 12, 2024

Murazaki commented Feb 12, 2024

Murazaki commented Feb 15, 2024 • edited Loading

ABeltramo commented Feb 15, 2024

ehfd commented Feb 25, 2024 • edited Loading

ehfd commented Mar 28, 2024 • edited Loading

tux-rampage commented Apr 3, 2024

tux-rampage commented Apr 5, 2024 • edited Loading

tux-rampage commented Apr 9, 2024

What's the issue?

What can we do better?

tux-rampage commented Apr 9, 2024 • edited Loading

ehfd commented Apr 10, 2024

tux-rampage commented Apr 10, 2024

ehfd commented May 10, 2024 • edited Loading

tux-rampage commented Jun 1, 2024

ABeltramo commented Jun 2, 2024

ehfd commented Jun 2, 2024 • edited Loading

tux-rampage commented Jun 4, 2024

tux-rampage commented Jun 5, 2024

ehfd commented Jun 5, 2024

kayakyakr commented Jun 27, 2024

ABeltramo commented Jun 29, 2024

ehfd commented Jun 29, 2024 • edited Loading

ABeltramo commented Jun 29, 2024

ehfd commented Jun 29, 2024 • edited Loading

ABeltramo commented Jun 29, 2024

ehfd commented Jul 9, 2024

ehfd commented Jul 24, 2024

ABeltramo commented Jul 24, 2024

leiserfg commented Sep 8, 2024

leiserfg commented Sep 8, 2024 • edited Loading

ABeltramo commented Sep 8, 2024

leiserfg commented Sep 9, 2024

leiserfg commented Sep 9, 2024

ABeltramo commented Sep 9, 2024

leiserfg commented Sep 9, 2024

leiserfg commented Sep 9, 2024

ABeltramo commented Sep 9, 2024

tux-rampage commented Sep 19, 2024

ehfd commented Nov 18, 2023 •

edited

Loading

ehfd commented Jan 9, 2024 •

edited

Loading

Murazaki commented Feb 15, 2024 •

edited

Loading

ehfd commented Feb 25, 2024 •

edited

Loading

ehfd commented Mar 28, 2024 •

edited

Loading

tux-rampage commented Apr 5, 2024 •

edited

Loading

tux-rampage commented Apr 9, 2024 •

edited

Loading

ehfd commented May 10, 2024 •

edited

Loading

ehfd commented Jun 2, 2024 •

edited

Loading

ehfd commented Jun 29, 2024 •

edited

Loading

ehfd commented Jun 29, 2024 •

edited

Loading

leiserfg commented Sep 8, 2024 •

edited

Loading