error: requested docker runtime "nvidia" was not found #48

geekodour · 2024-07-28T09:17:01Z

I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly. nvidia-container-toolkit is also installed and am able to access gpu from container directly using docker run but unable to access from nomad.

what happened

I am running the following as a debug job:

#file: debug.hcl
  group "gpu-smi" {
    task "gpu-smi" {
      driver = "docker"

      config {
        # docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi # docker 24
        # docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi  # docker 25
        image = "nvidia/cuda:12.0.0-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }

Running: nomad job run debug.hcl,

Driver Failure: Failed to create container configuration for image "nvidia/cuda:12.0.0-base-ubuntu20.04" 
("sha256:612aabcfe23834dde204beebc9f24dd8b8180479bfd45bdeada5ee9613997955"): requested docker runtime
"nvidia" was not found

I think the issue is more around docker <-> nvidia-container-toolkit but since:
docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like the name attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.

A related issue:

runtime nvidia not found because of missing nvidia-container-toolkit: https://discuss.hashicorp.com/t/requested-docker-runtime-nvidia-was-not-found/33213 but it mentioned they got it solved by installing
nvidia-container-runtime broken after nix 24.05 upgrade, possible issues might have to do with docker CDI or nvidia-container-runtime itself. https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/2
docker including CDI support post docker25: nvidia-container-runtime fails to run containers with the -it flag NixOS/nixpkgs#322400
deprecated -runtime flag for --gpu: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4585
Good summary about all the packages: What's the difference between the lastest nvidia-docker and nvidia container runtime？ NVIDIA/nvidia-docker#1268 (comment)

The text was updated successfully, but these errors were encountered:

geekodour · 2024-07-28T12:27:24Z

Even after having nvidia-container-toolkit in place, getting:

Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown

geekodour · 2024-07-28T16:41:23Z

So I experimented with a couple of different configurations, what seems to work for now:

virtualisation.docker.enableNvidia = true; (This is being deprecated here: nvidia-container-runtime fails to run containers with the -it flag NixOS/nixpkgs#322400)
Using nomad-docker overlay from nix:23.11 as described here: https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3

So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia which is something I think nomad makes use of.

I think there are few actions out of this:

Make sure that we don't completely remove virtualisation.docker.enableNvidia till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
See if we can adapt nomad to use CDI instead of using runtime:nvidia

I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.

cc: @tgross

geekodour mentioned this issue Jul 28, 2024

nvidia-container-runtime fails to run containers with the -it flag NixOS/nixpkgs#322400

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error: requested docker runtime "nvidia" was not found #48

error: requested docker runtime "nvidia" was not found #48

geekodour commented Jul 28, 2024 •

edited

Loading

geekodour commented Jul 28, 2024

geekodour commented Jul 28, 2024

error: requested docker runtime "nvidia" was not found #48

error: requested docker runtime "nvidia" was not found #48

Comments

geekodour commented Jul 28, 2024 • edited Loading

what happened

geekodour commented Jul 28, 2024

geekodour commented Jul 28, 2024

geekodour commented Jul 28, 2024 •

edited

Loading