-
-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA kernel/userspace interface issue due to cuda.so.1 #11390
Comments
After sleeping on this I decided to create a var/nix/lib directory, add it to LD_LIBRARY_PATH, and symlink any host libraries I need Nix binaries to pick up into it. Seems to work okay and sort of follows the same spirit of using environment variables such as SSL_CERT_FILE to interface bits of the host with Nix. Would still be interest in hearing if anyone else has any better ideas though. |
In this case, nix can't control the kernel ABI. A pragmatic way to manage the userspace/kernel version is to pin the linuxPackages, kernel headers, and kernel driver versions in nixpkgs to what the host system provides. So, in
Where Likewise the nvidia driver version would need to be pinned. It might even be possible to do kernel ABI auto-detection.. ? |
Well, on NixOS we have |
For what it is worth, I have NixOS running in a small farm of servers with GPGPUs / CUDA, and have found it works fine as long as the package you're running is built by nix. Here is some of my config: https://github.com/grahamc/nixos-cuda-example |
Thanks for the feedback everyone. Timely that vcunat mentioned /run/opengl-driver/lib and #9415 as well. Just finished running into that the other day fighting to get OpenGL acceleration working. |
@grahamc: the point is non-NixOS here. There are some libraries that need to be supplied impurely, based on what your OS/kernel is running, and nix-built packages don't know how to find those libs on non-NixOS. So far we don't have any good/automatic solution, and in case of libGL sometimes even symlinking isn't enough, as linked above. |
Yes. I was thinking this morning that perhaps what this ticket is really looking for is the installer script to
Cheers! -Tyson |
@twhitehead I am currently going the route of installing |
@blogle Yup. The key is that only libcuda.so* has to match your kernel version. What you want to do is
This will cause your Nix binaries to use the ubuntu runtime libcuda.so and the Nix stuff for the rest. |
One thing that is becoming more clear to me is running Nix inside another distribution is that LD_PRELOAD/LD_LIBRARY_PATH were written under the implicit standard-distribution assumption that there is only one version of the core libraries. They apply universally without mercy to everything creating conflicts as soon as you have multiple versions. Even the wrappers that set LD_LIBRARY_PATH are problematic as any child process inherits the LD_PRELOAD/LD_LIBRARY_PATH settings. For this reason I would almost suggest that maybe their support should be removed from the Nix dynamic linker/loader ( A less drastic option might be to patch the Nix dynamic linker/loader so it uses NIX_LD_PRELOAD/NIX_LD_LIBRARY_PATH instead of LD_PRELOAD/LD_LIBRARY_PATH. This is still broken though as you are effectively assuming that a particular Nix library is okay for all Nix binaries, which isn't technically true given Nix's support for multiple versions of libraries and binaries. |
To any future travelers, note that I also had to symlink the contents of |
Any news on this issue? |
Thank you for your contributions. This has been automatically marked as stale because it has had no activity for 180 days. If this is still important to you, we ask that you leave a comment below. Your comment can be as simple as "still important to me". This lets people see that at least one person still cares about this. Someone will have to do this at most twice a year if there is no other activity. Here are suggestions that might help resolve this more quickly:
|
When you use the nvidia x11 drivers you get |
Bibliography: - https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 - https://nixos.org/community/teams/cuda - https://github.com/orgs/NixOS/projects/27 - https://alternativebit.fr/posts/nixos/nix-opengl-and-ubuntu-integration-nightmare - https://lobste.rs/s/7h20zl/nix_opengl_ubuntu_integration_nightmare - NixOS/nixpkgs#11390 (comment) - NixOS/nixpkgs#269475 - tweag/nix-hour#61
Bibliography: - https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 - https://nixos.org/community/teams/cuda - https://github.com/orgs/NixOS/projects/27 - https://alternativebit.fr/posts/nixos/nix-opengl-and-ubuntu-integration-nightmare - https://lobste.rs/s/7h20zl/nix_opengl_ubuntu_integration_nightmare - NixOS/nixpkgs#11390 (comment) - NixOS/nixpkgs#269475 - tweag/nix-hour#61
Bibliography: - https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 - https://nixos.org/community/teams/cuda - https://github.com/orgs/NixOS/projects/27 - https://alternativebit.fr/posts/nixos/nix-opengl-and-ubuntu-integration-nightmare - https://lobste.rs/s/7h20zl/nix_opengl_ubuntu_integration_nightmare - NixOS/nixpkgs#11390 (comment) - NixOS/nixpkgs#269475 - tweag/nix-hour#61
Bibliography: - https://discourse.nixos.org/t/on-nixpkgs-and-the-ai-follow-up-to-2023-nix-developer-dialogues/37087 - https://nixos.org/community/teams/cuda - https://github.com/orgs/NixOS/projects/27 - https://alternativebit.fr/posts/nixos/nix-opengl-and-ubuntu-integration-nightmare - https://lobste.rs/s/7h20zl/nix_opengl_ubuntu_integration_nightmare - NixOS/nixpkgs#11390 (comment) - NixOS/nixpkgs#269475 - tweag/nix-hour#61
We are running Nix as a package manager alongside CentOS 6. It is working good except for the fact that CUDA packages aren't working. For example,
caffe
(from nixpkgs tag 15.09) givesUsing
strace
I discovered the problem is that thecudatoolkit
libraries are trying to load a libcuda.so.1 library (although it is not declared as an official a shared library dependency by any of the ELF objects). Digging reveals this is a kernel/user-space shimmy library shipped with the CUDA kernel driver and that it is tightly tied to the specific version of that CUDA kernel driver.Using LD_PRELOAD to load this into the NIX binary results in it working as expected
How should Nix properly handle this? The libcuda.so.1 could be put in the Nix store, but it turns out that quickly creates a fragile situation where any update to the host NVIDIA driver version instantly breaks all installed Nix CUDA binaries as, if libcuda.so.1 does not perfectly match the kernel NVIDIA driver version, CUDA executable fail and the kernel spits messages of the form
To me this suggests that, as Nix uses the host kernel, it should also use the host provided version of this library. The logic here being that libcuda.so.1 is so tightly tied to the kernel it should really be considered part of the kernel interface (provided by the host) and not userspace (provided by Nix).
What do people think, and how might this best be done?
Thanks! -Tyson
The text was updated successfully, but these errors were encountered: