You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly. nvidia-container-toolkit is also installed and am able to access gpu from container directly using docker run but unable to access from nomad.
Driver Failure: Failed to create container configuration for image "nvidia/cuda:12.0.0-base-ubuntu20.04"
("sha256:612aabcfe23834dde204beebc9f24dd8b8180479bfd45bdeada5ee9613997955"): requested docker runtime
"nvidia" was not found
I think the issue is more around docker <-> nvidia-container-toolkit but since: docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like the name attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.
Even after having nvidia-container-toolkit in place, getting:
Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown
So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia which is something I think nomad makes use of.
I think there are few actions out of this:
Make sure that we don't completely remove virtualisation.docker.enableNvidia till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
See if we can adapt nomad to use CDI instead of using runtime:nvidia
I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.
I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly.
nvidia-container-toolkit
is also installed and am able to access gpu from container directly usingdocker run
but unable to access from nomad.what happened
I am running the following as a debug job:
Running:
nomad job run debug.hcl
,I think the issue is more around
docker <-> nvidia-container-toolkit
but since:docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like thename
attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.A related issue:
nvidia-container-runtime
fails to run containers with the-it
flag NixOS/nixpkgs#322400The text was updated successfully, but these errors were encountered: