Skip to content
This repository was archived by the owner on Jan 22, 2024. It is now read-only.

Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

Closed
btalb opened this issue Mar 8, 2021 · 4 comments

Comments

@btalb
Copy link

btalb commented Mar 8, 2021

1. Issue or feature description

Docker containers fail to start with any of the --gpu options, reporting the following error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

2. Steps to reproduce the issue

Cannot be reproduced on any of our machines, but happens consistently for a user of our software (which relies on GPU access within Docker containers).

Users with the issue are: @gmuraleekrishna, @hagianga21

We have tried all of the common suggestions like restarting Docker services, rebooting, fiddling with GPU arguments, reverting to base CUDA images, etc.

3. Information to attach (optional if deemed irrelevant)

Please see this issue where we've gathered whatever information I thought may be relevant. We have tried a number of things, but can't successfully start a GPU-enabled Docker container under any circumstances.

We also can't seem to get any extra relevant information besides that non-descript Docker daemon response error.

Notes

To be fair, I'm confused by all of the conflicting information I'm finding about how the toolkit works (or even what it is):

  • this suggests you need the toolkit + runtime, but none of our working machines have the runtime installed
  • none of our working machines have a daemon.json file or runtime, and all the debugging tips have suggestions that work with the runtime
  • I've seen the toolkit referred to as this repo (which seems to be just nvidia-docker2), an entire set of packages, a package within that set. What exactly is it? And should it work by itself??
  • I think NVIDIA/nvidia-container-toolkit is the source of what we are actually using, but why does it have no README / the README of this repo calls itself the "NVIDIA Container Toolkit"??? Also, why are there parameters in that config for turning logging on in the nvidia-container-runtime when the apt package tree suggests it isn't used / installed??

TL;DR: I've spent a week digging through this trying to understand what's causing this bug, and I'm still no clearer on how this system works, what's legacy & what isn't, how to get debugging information, or where the issue could be coming from. Any help / tips would be greatly appreciated.

@klueska
Copy link
Contributor

klueska commented Mar 8, 2021

Hi @btalb

This should hopefully help to clear up some of the confusion about how the stack is organized:
#1268 (comment)

Regarding the issue you linked. Everything seems normal except that he is running docker version 20.10.4 (the latest we test on is 19.03).

Can you have him try either:

  1. Downgrading to 19.03; or
  2. Installing nvidia-container-runtime (instead of just nvidia-container-toolkit) and then updating his /etc/docker/daemon.json to:
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

@klueska
Copy link
Contributor

klueska commented Mar 8, 2021

Also -- to be clear -- the exact error you are seeing is coming from docker (not the nvidia container stack). And it suggests that the --gpus option is not being recognized for some reason.

Docker 19.03 added code to call directly out to our container stack at the nvidia-container-toolkit level, instead of relying on the nvidia-container-runtime to do that for them. For some reason though, it appears the --gpus option is not being recognized. I'm wondering if they have some regression in their 20.10.4 version.

@btalb
Copy link
Author

btalb commented Mar 12, 2021

Thanks @klueska, that linked post is wonderful, and helps clear up a lot of confusion!

I don't know if it's feasible, but it would be nice to have that sort of information at least in the README of this repo. When I google "github nvidia-container-toolkit", this repo is the first result.

As to the actual issue, we've got a number of machines in the lab running 20.10.4 and just the nvidia-container-toolkit without error.

When you say "call directly out to our container stack at the nvidia-container-toolkit level", what command does that actually involve? Is there a way for me to run this on the command line to isolate further the issue?

I've done some digging into Docker's source and think this part points to this as the command being run:

nvidia-container-runtime-hook prestart

Is that correct? When I run that manually on my machine it sits there blocking with no output. Is that to be expected?

Any further tips would be greatly appreciated. I am just trying to get to a point where we get some more meaningful output about what is going on.

@klueska
Copy link
Contributor

klueska commented Mar 18, 2021

One thing to note -- the snap installation of docker on Ubuntu is known to have this issue.

@elezar elezar closed this as completed Oct 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants