0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

nikosep · 2020-03-03T10:22:44Z

Facing this old issue.
I have gone through all the relevant workarounds, although still the issue persists.

Kubernetes version: 1.14
Docker version on GPU node: 19.03.6
GPU node: 4 x GTX1080Ti

I am trying to deploy this example:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: tensorflow-gpu
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: tensorflow-gpu
    spec:
      volumes:
      - hostPath:
          path: /usr/lib/nvidia-418/bin
        name: bin
      - hostPath:
          path: /usr/lib/nvidia-418
        name: lib
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so.1
        name: libcuda-so-1
      - hostPath:
          path: /usr/lib/x86_64-linux-gnu/libcuda.so
        name: libcuda-so
      containers:
      - name: tensorflow
        image: tensorflow/tensorflow:latest-gpu
        ports:
        - containerPort: 8888
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /usr/local/nvidia/bin
          name: bin
        - mountPath: /usr/local/nvidia/lib
          name: lib
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1
          name: libcuda-so-1
        - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so
          name: libcuda-so
---
apiVersion: v1
kind: Service
metadata:
  name: tensorflow-gpu-service
  labels:
    app: tensorflow-gpu
spec:
  selector:
    app: tensorflow-gpu
  ports:
  - port: 8888
    protocol: TCP
    nodePort: 30061
  type: LoadBalancer
---

And I am getting the following error:
0/2 nodes are available: 2 Insufficient nvidia.com/gpu

Specifying the GPU node explicitly on the deployment yaml I am getting the following error:
Update plugin resources failed due to requested number of devices unavailable for nvidia.com/gpu. Requested: 1, Available: 0, which is unexpected.

/etc/docker/daemon.json on GPU node:

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

I have restarted docker and kubelet.

I am using this nvidia daemon:
https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

Should I
-label somehow the GPU node that has nvidia gpu?
-restart master node?

Any help here is more than welcome !

The text was updated successfully, but these errors were encountered:

Sarang-Sangram · 2020-03-18T08:59:27Z

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

nikosep · 2020-03-18T09:04:20Z

I am facing the same issue , going through the container logs it is throwing below error , which I assume is something wrong with the image itself:

libdc1394 error: Failed to initialize libdc1394

I think you need to use as base Dockerfile image the nvidia one:
FROM nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04
( I guess you have installed nvidia daemon on the cluster)

Sarang-Sangram · 2020-03-18T09:37:43Z

You mean in the pod spec file ? Even after I use the above image I am seeing error :

libdc1394 error: Failed to initialize libdc1394

Sarang-Sangram · 2020-03-18T09:57:32Z

So I skipped that example pod, and tried this deployment with less no, of replicas and it worked fine

https://github.com/NVIDIA/k8s-device-plugin/blob/examples/workloads/deployment.yml

RenaudWasTaken · 2020-03-21T06:47:15Z

Hello!

Sorry for the lag, can you fill the default issue template, this is usually super helpful and it's easier to help :)

The output of nvidia-smi -a on your host
Your docker configuration file (e.g: /etc/docker/daemon.json)
The k8s-device-plugin container logs
The node description (kubectl describe nodes)
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

regulusv · 2020-05-08T01:24:40Z

@RenaudWasTaken I think the issue is Docker default runtime is unable to set "nvidia" for Docker 19.03, runtime : nvidia has been deprecated, we need a fix on that

klueska · 2020-05-08T21:57:18Z

Removed my previous comment with a link to this one so that there is one canonical place with a response to this issue:

#168 (comment)

github-actions · 2024-02-29T04:25:29Z

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions · 2024-03-31T04:26:23Z

This issue was automatically closed due to inactivity.

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

nikosep commented Mar 3, 2020 •

edited

Loading

Sarang-Sangram commented Mar 18, 2020

nikosep commented Mar 18, 2020 •

edited

Loading

Sarang-Sangram commented Mar 18, 2020

Sarang-Sangram commented Mar 18, 2020

RenaudWasTaken commented Mar 21, 2020

regulusv commented May 8, 2020

klueska commented May 8, 2020

github-actions bot commented Feb 29, 2024

github-actions bot commented Mar 31, 2024

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

0/2 nodes are available: 2 Insufficient nvidia.com/gpu #159

Comments

nikosep commented Mar 3, 2020 • edited Loading

Sarang-Sangram commented Mar 18, 2020

nikosep commented Mar 18, 2020 • edited Loading

Sarang-Sangram commented Mar 18, 2020

Sarang-Sangram commented Mar 18, 2020

RenaudWasTaken commented Mar 21, 2020

regulusv commented May 8, 2020

klueska commented May 8, 2020

github-actions bot commented Feb 29, 2024

github-actions bot commented Mar 31, 2024

nikosep commented Mar 3, 2020 •

edited

Loading

nikosep commented Mar 18, 2020 •

edited

Loading