Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate NVIDIA container runtime on SSH instances #1947

Open
jvstme opened this issue Nov 1, 2024 · 1 comment
Open

Validate NVIDIA container runtime on SSH instances #1947

jvstme opened this issue Nov 1, 2024 · 1 comment

Comments

@jvstme
Copy link
Collaborator

jvstme commented Nov 1, 2024

Steps to reproduce

  1. Prepare an instance with an NVIDIA GPU, Docker, and CUDA drivers, but without the NVIDIA container runtime (nvidia-container-toolkit).
  2. Create and apply an on-prem fleet configuration with the instance.

Actual behaviour

The fleet is created successfully but the GPU is not mentioned in its resources.

 FLEET    INSTANCE  BACKEND       RESOURCES                    PRICE  STATUS  CREATED     ERROR 
 on-prem  0         ssh (remote)  24xCPU, 71GB, 36.4GB (disk)  $0.0   idle    57 sec ago

The user may not notice that the GPU is missing, in which case they will only find out that something is wrong when trying to run a job on the instance.

Run failed with error code CONTAINER_EXITED_WITH_ERROR.
Error: could not select device driver "" with capabilities: []
Check CLI, server, and run logs for more details.

Expected behaviour

Fleet provisioning fails, the user sees an error about the NVIDIA runtime being misconfigured on the instance.

dstack version

0.18.22

Server logs

[22:37:27] DEBUG    dstack._internal.server.background.tasks.process_instances:388 Received a host_info {'gpu_vendor': 'none', 'gpu_name': '', 'gpu_memory': 0,
                    'gpu_count': 0, 'addresses': ['10.0.160.57/16', 'fe80::17ff:fe09:d261/64', '172.17.0.1/16'], 'disk_size': 39050715136, 'cpus': 24,         
                    'memory': 75869425664}

Additional information

No response

Copy link

github-actions bot commented Dec 2, 2024

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant