Skip to content

Commit

Permalink
[shim] Change NVIDIA GPU detection method (#1945)
Browse files Browse the repository at this point in the history
* Check for `/dev/nvidiactl`, not `nvidia-smi` binary, to detect NVIDIA
  GPU. Unlike the binary, the devfs file only exists if some conditions
  are met (the exact way how `/dev/ndivia*` character device files are
  created is complicated and setup-specific — involving some of: kernel
  module, udev, modprobe, nvidia-persistenced, X server, and more — but
  in general, it should be safe to assume that if NVIDIA GPU
  is available, then `/dev/nvidiactl` does exist.
* Run `nvidia-smi` to get GPU info directly on the host, not inside a
  container. Using Docker is completely unnecessary, as NVIDIA Container
  Toolkit mounts libs and executables from the host — dstack-provided
  Docker image doesn't even contain `nvidia-smi` binary, it's always
  a bind-mounted file from the host.

Fixes: #1942
  • Loading branch information
un-def authored Nov 4, 2024
1 parent 8a61781 commit 1a3dca6
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 11 deletions.
1 change: 1 addition & 0 deletions runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ These are nonexhaustive lists of external dependencies (executables, libraries)
* `mountpoint`
* `lsblk`
* `mkfs.ext4`
* (NVIDIA GPU SSH fleet instances only) `nvidia-smi`
* ...

Debian/Ubuntu packages: `mount` (`mount`, `umount`), `util-linux` (`mountpoint`, `lsblk`), `e2fsprogs` (`mkfs.ext4`)
Expand Down
14 changes: 3 additions & 11 deletions runner/internal/shim/gpu.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ import (
"io"
"log"
"os"
"os/exec"
"strconv"
"strings"

execute "github.com/alexellis/go-execute/v2"
)

const nvidiaSmiImage = "dstackai/base:py3.13-0.6-cuda-12.1"
const amdSmiImage = "un1def/amd-smi:6.2.2-0"

type GpuVendor string
Expand All @@ -36,7 +34,7 @@ func GetGpuVendor() GpuVendor {
if _, err := os.Stat("/dev/kfd"); !errors.Is(err, os.ErrNotExist) {
return Amd
}
if _, err := exec.LookPath("nvidia-smi"); err == nil {
if _, err := os.Stat("/dev/nvidiactl"); !errors.Is(err, os.ErrNotExist) {
return Nvidia
}
return NoVendor
Expand All @@ -56,14 +54,8 @@ func getNvidiaGpuInfo() []GpuInfo {
gpus := []GpuInfo{}

cmd := execute.ExecTask{
Command: "docker",
Args: []string{
"run",
"--rm",
"--gpus", "all",
nvidiaSmiImage,
"nvidia-smi", "--query-gpu=gpu_name,memory.total", "--format=csv,nounits",
},
Command: "nvidia-smi",
Args: []string{"--query-gpu=gpu_name,memory.total", "--format=csv,nounits"},
StreamStdio: false,
}
res, err := cmd.Execute(context.Background())
Expand Down

0 comments on commit 1a3dca6

Please sign in to comment.