-
Notifications
You must be signed in to change notification settings - Fork 2k
docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error #1017
Comments
Can anyone help me? Thanks. |
We need more information. Can you do this:
|
@Ethyling thanks for your help. |
Can you do this too, and give us the output:
Thank you! |
I'm getting EXACTLY the same error as the reporter, on all my dev, staging, and production systems, as soon as ElasticBeanstalk tries to create new servers. Code / Config hasn't been touched in 15 days, and everything has been stable for the last 2 weeks, so something referenced during EC2 startup is failing ( I hardcode my nvidia driver install etc into my baked AMI ). the only thing it could be really is the nvidia-docker install as that is installed on server startup. |
I am also seeing this same issue @mikecouk. We use nvidia-docker on some AWS beanstalk environments. Starting last Friday, deploys stopped working due to this issue. It looks like some automatic updates trigger and update nvidia-docker, and now it no longer works. |
We are currently working on this, but first we need more information:
|
@Ethyling : `[ec2-user]$ whereis runc runc: [ec2-user]$ whereis docker-runc docker-runc: /usr/bin/docker-runc ` |
@mikeobr Thank you, I will come back to you asap |
We're currently testing a fix by aliasing docker-runc to runc, which as you've already worked out doesn't exist anymore :) Will get back to you asap. |
Hi, I am facing the same problem.
thanks |
Working with @mikecouk on this one. While |
The "brett hack" to give it it's official name has fixed our problem, currently rolling it out to all servers before they have a chance to die and respawn ( we use spot instances ! ). The line you'll want for the moment in your scripts AFTER then nvidia-docker2 yum install is ;
Phew, panic over the for the moment, but could do with an explanation on this ( even if it's blame AWS ! ) |
Please use the 'brett hack' for the time being. This is a combination of two underlying non-issues that make an issue because of incompatibility:
What next: a new release in the pend for nvidia-docker that will look for docker-runc if runc is not found. It is still better to chase the binary's new name rather than chase its version, so point number '1' from above is here to stay. |
I am getting the same runc error, but I am using Ubuntu. Can someone tell me what is the "brett hack" version for this? |
Create a runc symlink to point to docker-runc. As @mikecouk will do it:
|
The brett hack works, thanks for the quick workaround.
I'm pretty sure AWS ElasticBeanstalk runs something like a yum update during install, I've had a similar issue before where stuff started breaking with no code changes. If you find a good way to lock versions or prevent this, I'd be happy to know 😃 |
|
Is anyone else having issues again today? I put the Brett hack into place yesterday and it resolved the issue. Today I am seeing the same problem again (unable to retrieve OCI runtime error) with the workaround in place. This is on AWS Beanstalk, so I'm not sure if it pulled a newer version. EDIT: @ramab1988 's alias works for resolving the issue. Did something change that caused the initial workaround to no longer worker? |
Hi @mikeobr, can you do this please:
It will help us a lot, thank you! |
@Ethyling Here are results before I added the nvidia-container-toolkit symlink Log file
Whereis results (this is with the Brett hack) |
Can you try to update nvidia-docker2 and nvidia-container-runtime? This problem should be fixed |
Hey @Ethyling, I'm not as familiar with CentOS (where my stuff is hosted) but yum update and upgrade did not pull any new versions. |
@mikeobr can you comment with the version of the packages? |
$ rpm -qa 'nvidia' |
I'm seeing a similar problem in CentOS 7, my system info / error is posted in my reply to this (possibly related) bug report: NVIDIA/nvidia-container-runtime#68 UPDATE: Appears to be working now after a few re-arrangements of packages deployed and changes to how |
We are working on this, expect this to be fixed by End of day. |
Hello! We've released new packages yesterday, you should be good to upgrade now. Thanks for reporting the issue, closing for now! |
Hi! How to use this workaround? if [ ! -f /usr/bin/runc -a -f /usr/bin/docker-runc ]; then ln -s /usr/bin/docker-runc /usr/bin/runc; else echo "DID NOT CREATE RUNC SYMLINK"; fi |
solved problem for me, thanks |
|
Thank you Sir! this worked for me |
Thanks @whillas that worked for me |
@whillas holy !@#$ thank you :D that command seems really relevant and I'm surprised I didn't encounter it in the installation process. |
sudo apt install nvidia-container-runtime also worked for me |
I had the same problem in my deep learning server, someone can help me?
|
@andrewssobral rootless podman with GPU support is a separate beast, it is being discussed in nvidia-container runtime issue 85 |
@qhaas Thanks! |
This really works, with containerd+k8s running on our servers |
I don't know why but sometimes the link rpm -qf /usr/bin/nvidia-container-runtime-hook Reinstallation of the package solves this issue too (i.e. creates the link): yum reinstall nvidia-container-toolkit -y |
It is really sad that I have tried all the methods above, but it still does not work. $ whereis runc
runc: /usr/local/bin/runc
$ whereis docker-runc
docker-runc: $ ll /usr/bin/nvidia-container-toolkit
lrwxrwxrwx 1 root root 38 Jun 18 22:52 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook I have tried to installed $ rpm -qa | grep nvidia
nvidia-container-toolkit-1.13.1-1.x86_64
nvidia-container-toolkit-base-1.13.1-1.x86_64
libnvidia-container-tools-1.13.1-1.x86_64
nvidia-container-runtime-3.13.0-1.noarch
libnvidia-container1-1.13.1-1.x86_64 Is there anyone can help? $ containerd -v
containerd github.com/containerd/containerd v1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 |
I open the debug option,and exec command $ nerdctl run --runtime=nvidia --rm nvidia/cuda:12.0.1-cudnn8-runtime-centos7 nvidia-smi
FATA[0000] failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/default/71217aa3e35ca15212f54d1f7ca6111d8faa6f7b8910d9ccfec60507ee0e633f/log.json: no such file or directory): exec: "nvidia": executable file not found in $PATH: unknown But there is no Here is the config: # grep -v '^#' /etc/nvidia-container-runtime/config.toml
disable-require = false
[nvidia-container-cli]
environment = []
load-kmods = true
ldconfig = "@/sbin/ldconfig"
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"
runtimes = [
"docker-runc",
"runc",
]
mode = "auto"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d" |
This issue is that the Please try with:
|
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Also, before reporting a new issue, please make sure that:
1. Issue or feature description
I'm tring to install nvidia-docker v2, and follow the steps
at the last step:
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
it will have error message:
docker: Error response from daemon: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/56ca0b73c5720021671123b7f44c885bb1e7b42957c9b18e7b509be26760b993/log.json: no such file or directory): nvidia-container-runtime did not terminate sucessfully: unknown.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
`e3380-8b5c-cbf2-8bb2-1bcced59103d at 00000000:01:00.0)
NVRM version: 418.56
CUDA version: 10.1
Device Index: 0
Device Minor: 0
Model: GeForce GTX 1080 Ti
Brand: GeForce
GPU UUID: GPU-e17e3380-8b5c-cbf2-8bb2-1bcced59103d
Bus Location: 00000000:01:00.0
Architecture: 6.1
I0720 05:40:32.999897 19015 nvc.c:318] shutting down library context
I0720 05:40:33.000865 19017 driver.c:192] terminating driver service
I0720 05:40:33.010816 19015 driver.c:233] driver service terminated successfully`
Kernel version from
uname -a
Linux cp 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Any relevant kernel output lines from
dmesg
Driver information from
nvidia-smi -a
`Timestamp : Sat Jul 20 13:42:03 2019
Driver Version : 418.56
CUDA Version : 10.1
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1080 Ti`
docker version
`Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:56 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:21 2018
OS/Arch: linux/amd64
Experimental: false`
NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-==============-============-============-================================= ii libnvidia-cont 1.0.2-1 amd64 NVIDIA container runtime library ii libnvidia-cont 1.0.2-1 amd64 NVIDIA container runtime library un nvidia-304 <none> <none> (no description available) un nvidia-340 <none> <none> (no description available) un nvidia-384 <none> <none> (no description available) un nvidia-common <none> <none> (no description available) ii nvidia-contain 3.0.0-1 amd64 NVIDIA container runtime ii nvidia-contain 1.4.0-1 amd64 NVIDIA container runtime hook un nvidia-docker <none> <none> (no description available) ii nvidia-docker2 2.1.0-1 all nvidia-docker CLI wrapper un nvidia-libopen <none> <none> (no description available) un nvidia-prime <none> <none> (no description available)
NVIDIA container library version from
nvidia-container-cli -V
version: 1.0.2
NVIDIA container library logs (see troubleshooting)
Docker command, image and tag used
tensorflow/tensorflow:nightly-gpu-py3-jupyter
The text was updated successfully, but these errors were encountered: