Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

btalb · 2021-03-08T01:18:44Z

1. Issue or feature description

Docker containers fail to start with any of the --gpu options, reporting the following error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]

2. Steps to reproduce the issue

Cannot be reproduced on any of our machines, but happens consistently for a user of our software (which relies on GPU access within Docker containers).

Users with the issue are: @gmuraleekrishna, @hagianga21

We have tried all of the common suggestions like restarting Docker services, rebooting, fiddling with GPU arguments, reverting to base CUDA images, etc.

3. Information to attach (optional if deemed irrelevant)

Please see this issue where we've gathered whatever information I thought may be relevant. We have tried a number of things, but can't successfully start a GPU-enabled Docker container under any circumstances.

We also can't seem to get any extra relevant information besides that non-descript Docker daemon response error.

Notes

To be fair, I'm confused by all of the conflicting information I'm finding about how the toolkit works (or even what it is):

this suggests you need the toolkit + runtime, but none of our working machines have the runtime installed
none of our working machines have a daemon.json file or runtime, and all the debugging tips have suggestions that work with the runtime
I've seen the toolkit referred to as this repo (which seems to be just nvidia-docker2), an entire set of packages, a package within that set. What exactly is it? And should it work by itself??
I think NVIDIA/nvidia-container-toolkit is the source of what we are actually using, but why does it have no README / the README of this repo calls itself the "NVIDIA Container Toolkit"??? Also, why are there parameters in that config for turning logging on in the nvidia-container-runtime when the apt package tree suggests it isn't used / installed??

TL;DR: I've spent a week digging through this trying to understand what's causing this bug, and I'm still no clearer on how this system works, what's legacy & what isn't, how to get debugging information, or where the issue could be coming from. Any help / tips would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

klueska · 2021-03-08T12:00:40Z

Hi @btalb

This should hopefully help to clear up some of the confusion about how the stack is organized:
#1268 (comment)

Regarding the issue you linked. Everything seems normal except that he is running docker version 20.10.4 (the latest we test on is 19.03).

Can you have him try either:

Downgrading to 19.03; or
Installing nvidia-container-runtime (instead of just nvidia-container-toolkit) and then updating his /etc/docker/daemon.json to:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

klueska · 2021-03-08T12:06:08Z

Also -- to be clear -- the exact error you are seeing is coming from docker (not the nvidia container stack). And it suggests that the --gpus option is not being recognized for some reason.

Docker 19.03 added code to call directly out to our container stack at the nvidia-container-toolkit level, instead of relying on the nvidia-container-runtime to do that for them. For some reason though, it appears the --gpus option is not being recognized. I'm wondering if they have some regression in their 20.10.4 version.

btalb · 2021-03-12T07:07:17Z

Thanks @klueska, that linked post is wonderful, and helps clear up a lot of confusion!

I don't know if it's feasible, but it would be nice to have that sort of information at least in the README of this repo. When I google "github nvidia-container-toolkit", this repo is the first result.

As to the actual issue, we've got a number of machines in the lab running 20.10.4 and just the nvidia-container-toolkit without error.

When you say "call directly out to our container stack at the nvidia-container-toolkit level", what command does that actually involve? Is there a way for me to run this on the command line to isolate further the issue?

I've done some digging into Docker's source and think this part points to this as the command being run:

nvidia-container-runtime-hook prestart

Is that correct? When I run that manually on my machine it sits there blocking with no output. Is that to be expected?

Any further tips would be greatly appreciated. I am just trying to get to a point where we get some more meaningful output about what is going on.

klueska · 2021-03-18T11:59:14Z

One thing to note -- the snap installation of docker on Ubuntu is known to have this issue.

btalb mentioned this issue Mar 8, 2021

Unable to install benchbot qcr/benchbot#13

Closed

elezar closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

btalb commented Mar 8, 2021

klueska commented Mar 8, 2021 •

edited

Loading

klueska commented Mar 8, 2021 •

edited

Loading

btalb commented Mar 12, 2021

klueska commented Mar 18, 2021

Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

Docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]] #1470

Comments

btalb commented Mar 8, 2021

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Notes

klueska commented Mar 8, 2021 • edited Loading

klueska commented Mar 8, 2021 • edited Loading

btalb commented Mar 12, 2021

klueska commented Mar 18, 2021

klueska commented Mar 8, 2021 •

edited

Loading

klueska commented Mar 8, 2021 •

edited

Loading