Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shim] Change NVIDIA GPU detection method #1945

Merged
merged 1 commit into from
Nov 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ These are nonexhaustive lists of external dependencies (executables, libraries)
* `mountpoint`
* `lsblk`
* `mkfs.ext4`
* (NVIDIA GPU SSH fleet instances only) `nvidia-smi`
* ...

Debian/Ubuntu packages: `mount` (`mount`, `umount`), `util-linux` (`mountpoint`, `lsblk`), `e2fsprogs` (`mkfs.ext4`)
Expand Down
14 changes: 3 additions & 11 deletions runner/internal/shim/gpu.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,12 @@ import (
"io"
"log"
"os"
"os/exec"
"strconv"
"strings"

execute "github.com/alexellis/go-execute/v2"
)

const nvidiaSmiImage = "dstackai/base:py3.13-0.6-cuda-12.1"
const amdSmiImage = "un1def/amd-smi:6.2.2-0"

type GpuVendor string
Expand All @@ -36,7 +34,7 @@ func GetGpuVendor() GpuVendor {
if _, err := os.Stat("/dev/kfd"); !errors.Is(err, os.ErrNotExist) {
return Amd
}
if _, err := exec.LookPath("nvidia-smi"); err == nil {
if _, err := os.Stat("/dev/nvidiactl"); !errors.Is(err, os.ErrNotExist) {
return Nvidia
}
return NoVendor
Expand All @@ -56,14 +54,8 @@ func getNvidiaGpuInfo() []GpuInfo {
gpus := []GpuInfo{}

cmd := execute.ExecTask{
Command: "docker",
Args: []string{
"run",
"--rm",
"--gpus", "all",
nvidiaSmiImage,
"nvidia-smi", "--query-gpu=gpu_name,memory.total", "--format=csv,nounits",
},
Comment on lines -59 to -66
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running nvidia-smi in Docker actually has a small advantage, since it also validates that the NVIDIA container runtime is installed and working. However, it's not reporting errors to the server, so the erroneous instance is still added to the fleet, just without any GPUs in resources. Considering this validation doesn't work properly, I think it's OK to drop it for now the way it's done in this PR and later implement it properly in #1947.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, didn't think about it in this context. The motivation to get rid of Docker here was to not rely on some third-party (from the point of view of the self-hosted dstack administrator) binary blob (dstack-provided Docker image) — yes, we have these images pre-pulled in the dstack-provided OS images, but if I use SSH fleets or a cloud provider with my own OS images, this check may slow down provisioning a bit (the time to pull our Docker image), and left the image that I have no intention to use cached on the instance.

I think that it would be enough to check Client.Info to ensure that 1) Docker is installed and running, and 2) NVIDIA Container Toolkit runtime is installed.

We still rely on un1def/amd-smi image as a temporary solution to get the info about AMD GPU, but I think we can either make amd-smi a hard requirement (include it in dstack-provided OS images and require it to be present on AMD-powered user-provided OS images and SSH instances) or bring our statically-linked amd-smi binary (not sure if it as easy to do as to say).

Command: "nvidia-smi",
Args: []string{"--query-gpu=gpu_name,memory.total", "--format=csv,nounits"},
StreamStdio: false,
}
res, err := cmd.Execute(context.Background())
Expand Down