Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: requested docker runtime "nvidia" was not found #48

Open
geekodour opened this issue Jul 28, 2024 · 2 comments
Open

error: requested docker runtime "nvidia" was not found #48

geekodour opened this issue Jul 28, 2024 · 2 comments

Comments

@geekodour
Copy link

geekodour commented Jul 28, 2024

I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly. nvidia-container-toolkit is also installed and am able to access gpu from container directly using docker run but unable to access from nomad.

what happened

I am running the following as a debug job:

#file: debug.hcl
  group "gpu-smi" {
    task "gpu-smi" {
      driver = "docker"

      config {
        # docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi # docker 24
        # docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi  # docker 25
        image = "nvidia/cuda:12.0.0-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }

Running: nomad job run debug.hcl,

Driver Failure: Failed to create container configuration for image "nvidia/cuda:12.0.0-base-ubuntu20.04" 
("sha256:612aabcfe23834dde204beebc9f24dd8b8180479bfd45bdeada5ee9613997955"): requested docker runtime
"nvidia" was not found

I think the issue is more around docker <-> nvidia-container-toolkit but since:
docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like the name attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.

A related issue:

@geekodour
Copy link
Author

Even after having nvidia-container-toolkit in place, getting:

Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown

@geekodour
Copy link
Author

So I experimented with a couple of different configurations, what seems to work for now:

  1. virtualisation.docker.enableNvidia = true; (This is being deprecated here: nvidia-container-runtime fails to run containers with the -it flag NixOS/nixpkgs#322400)
  2. Using nomad-docker overlay from nix:23.11 as described here: https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3

So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia which is something I think nomad makes use of.

I think there are few actions out of this:

  1. Make sure that we don't completely remove virtualisation.docker.enableNvidia till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
  2. See if we can adapt nomad to use CDI instead of using runtime:nvidia

I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.

cc: @tgross

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant