Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos/cdi.dynamic.nvidia: expose driverLink #291828

Merged
merged 2 commits into from
Mar 4, 2024

Conversation

SomeoneSerge
Copy link
Contributor

Description of changes

A quick follow-up to #284507. I'd hate to merge something with that many FIXME comments, but I'm not going to have the time to implement these in a while yet

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Feb 27, 2024
@github-actions github-actions bot added 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` labels Feb 27, 2024
@SomeoneSerge SomeoneSerge changed the title Refactor/cdi nvidia nixos/cdi.dynamic.nvidia: expose driverLink Feb 27, 2024
Comment on lines +31 to +34
{
hostPath = addDriverRunpath.driverLink;
containerPath = addDriverRunpath.driverLink;
}
Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only contains the symlinks, naturally. We need closureinfo to enumerate all their targets and their dependencies. Right now they are mounted "accidentally" as ${nvidia-driver}/lib

Comment on lines +15 to 16
{ hostPath = lib.getExe' nvidia-driver "nvidia-cuda-mps-control";
containerPath = "/usr/bin/nvidia-cuda-mps-control"; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only now realized that this isn't going to be preserved by any of the formatters we've got, but I wanted to keep the diff small(er)

@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 labels Feb 27, 2024
@ereslibre
Copy link
Member

I'd hate to merge something with that many FIXME comments, but I'm not going to have the time to implement these in a while yet

Let me have a look later today. I can try to help with those FIXME comments on a separate PR. This has been my first NixOS module contribution and I am not fully aware of all the details, but I can try to move that forward.

@ereslibre
Copy link
Member

I just tried this PR, I got the following error:

$ sudo nixos-rebuild --flake '.#hulk' switch
warning: Git tree '/home/ereslibre/projects/homelab' is dirty
building the system configuration...
warning: Git tree '/home/ereslibre/projects/homelab' is dirty
activating the configuration...
setting up /etc...
reloading user units for ereslibre...
restarting sysinit-reactivation.target
warning: the following units failed: nvidia-control-devices.service

× nvidia-control-devices.service
     Loaded: loaded (/etc/systemd/system/nvidia-control-devices.service; enabled; preset: enabled)
     Active: failed (Result: exit-code) since Tue 2024-02-27 21:59:17 CET; 227ms ago
   Duration: 10ms
    Process: 118225 ExecStart=/nix/store/jvd8kqg786b06vp8rsmv8fhaa24d6y0a-nvidia-x11-550.54.14-6.6.18-bin/bin/nvidia-smi (code=exited, status=18)
   Main PID: 118225 (code=exited, status=18)
         IP: 0B in, 0B out
        CPU: 10ms

Feb 27 21:59:17 hulk systemd[1]: Started nvidia-control-devices.service.
Feb 27 21:59:17 hulk nvidia-smi[118225]: Failed to initialize NVML: Driver/library version mismatch
Feb 27 21:59:17 hulk nvidia-smi[118225]: NVML library version: 550.54
Feb 27 21:59:17 hulk systemd[1]: nvidia-control-devices.service: Main process exited, code=exited, status=18/n/a
Feb 27 21:59:17 hulk systemd[1]: nvidia-control-devices.service: Failed with result 'exit-code'.
warning: error(s) occurred while switching to the new configuration

Copy link
Member

@ereslibre ereslibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, after a reboot it works fine. I guess the driver did update and I had to reboot to activate it. LGTM.

@ereslibre ereslibre added the 12.approvals: 1 This PR was reviewed and approved by one reputable person label Feb 27, 2024
Copy link
Contributor Author

@SomeoneSerge SomeoneSerge Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check (using this image: https://gist.github.com/SomeoneSerge/eda63ed8b51b795ab732e678da7d0e11):

Loaded image: localhost/docker-pytorch:ijlfl22js4p2lqr8lm1bj2a1ynv381wn
~/Unsorted/docker-pytorch took 2m28s
❯ podman run --rm -it --device=nvidia.com/gpu=all docker-pytorch:ijlfl22js4p2lqr8lm1bj2a1ynv381wn python
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
False
>>>
❯ sudo nixos-rebuild switch
...
❯ podman run --rm -it --device=nvidia.com/gpu=all docker-pytorch:ijlfl22js4p2lqr8lm1bj2a1ynv381wn python
Python 3.11.7 (main, Dec  4 2023, 18:10:11) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❯ systemctl status nvidia-container-toolkit-cdi-generator.service
● nvidia-container-toolkit-cdi-generator.service - Container Device Interface (CDI) for Nvidia generator
     Loaded: loaded (/etc/systemd/system/nvidia-container-toolkit-cdi-generator.service; enabled; preset: enabled)
     Active: active (exited) since Mon 2024-03-04 14:09:46 UTC; 1h 40min ago
   Main PID: 3304726 (code=exited, status=0/SUCCESS)
        CPU: 39ms

Mar 04 14:09:46 cs-338 systemd[1]: Starting Container Device Interface (CDI) for Nvidia generator...
Mar 04 14:09:46 cs-338 nvidia-cdi-generator[3304742]: time="2024-03-04T14:09:46Z" level=info msg="Auto-detected mode as \"nvml\""
Mar 04 14:09:46 cs-338 nvidia-cdi-generator[3304742]: time="2024-03-04T14:09:46Z" level=error msg="failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIB_RM_VERSION_MISMATCH"
Mar 04 14:09:46 cs-338 systemd[1]: Finished Container Device Interface (CDI) for Nvidia generator.

Doesn't correctly report the exit status (should have reported as failing)

@SomeoneSerge SomeoneSerge merged commit 46b75bf into NixOS:master Mar 4, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: module (update) This PR changes an existing module in `nixos/` 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 12.approvals: 1 This PR was reviewed and approved by one reputable person
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants