Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () when training #4366

Open
2679622694 opened this issue Nov 24, 2019 · 10 comments
Labels
Likely bug Maybe a bug, maybe not

Comments

@2679622694
Copy link

Thanks for your great work!
I set like this in Makefile:

GPU=1
CUDNN=1
CUDNN_HALF=1
OPENCV=1
AVX=0
OPENMP=0
LIBSO=0
ZED_CAMERA=0

Then run make,and goes well .No error occurs during make
But when I begin to train :
./darknet detector train cfg/obj.data cfg/obj.cfg darknet53.conv.74 -map
error occurs like following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 24 2019 - 12:59:52
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED

How to fix it?
The information of my device:

Intel® Core™ i5-9400F CPU @ 2.90GHz × 6
GeForce GTX 1660/PCIe/SSE2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Apr_24_19:10:27_PDT_2019
Cuda compilation tools, release 10.1, V10.1.168

@AlexeyAB
Copy link
Owner

do

make clean
make
  • Show screenshot of this error.
  • Try to set random=0 in cfg-file.
  • Does it work with CUDNN=0 CUDNN_HALF=0 ?
  • What cuDNN version do you use?
  • Attach your cfg-file.

@2679622694
Copy link
Author

2679622694 commented Nov 26, 2019

@AlexeyAB
cuDNN version 7.0.5
I just use yolov3-tiny.cfg to train.
if I set GPU=1 CUDNN=0 CUDNN_HALF=0 in makefile and random=1 in yolov3-tiny.cfg,it can train.
if I set GPU=1 CUDNN=1 CUDNN_HALF=1 in makefile and random=1 in yolov3-tiny.cfg,it can not train,even random=0 can not train.Just like following:

Total BFLOPS 5.454
Allocate additional workspace_size = 305.92 MB
Loading weights from yolov3-tiny.conv.15...
seen 64
Done! Loaded 15 layers from weights-file
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Loaded: 0.845207 seconds
cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 26 2019 - 22:28:47
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied
darknet: ./src/utils.c:295: error: Assertion `0' failed.

@AlexeyAB
Copy link
Owner

cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied

This is very strange error.

  • Try to run with sudo

  • Can you detection successfully with GPU=1 CUDNN=1 CUDNN_HALF=1 ?

  • Can you run any other application/ DNN-framework that uses cuDNN?

  • Show output of command nvidia-smi

@2679622694
Copy link
Author

  • Try to run with sudo

I set GPU=1 CUDNN=1 CUDNN_HALF=1 in makefile and random=1 in yolov3-tiny.cfg, then run
sudo ./darknet detector train color-hat.data yolov3-tiny.cfg yolov3-tiny.conv.15
It do not work as following:

Total BFLOPS 11.663
Allocate additional workspace_size = 59.71 MB
Loading weights from yolov3-tiny.conv.15...
seen 64
Done! Loaded 15 layers from weights-file
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
896 x 896
try to allocate additional workspace_size = 129.66 MB
CUDA allocate done!
Loaded: 0.117573 seconds

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

  • Can you detection successfully with GPU=1 CUDNN=1 CUDNN_HALF=1 ?
    No,I use yolov3.cfg and download yolov3.weights from https://pjreddie.com/darknet/yolo/ ,then run ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg,it shows that:

Total BFLOPS 65.864
Allocate additional workspace_size = 1099.43 MB
Loading weights from yolov3.weights...
seen 64
Done! Loaded 107 layers from weights-file

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied
darknet: ./src/utils.c:295: error: Assertion `0' failed.

but I can detect with GPU=1 CUDNN=0 CUDNN_HALF=0

  • Can you run any other application/ DNN-framework that uses cuDNN?
    I can train yolov3 or yolov3-tiny in pjreddie/darknet with the makefile set like following:

GPU=1
CUDNN=1
OPENCV=1
OPENMP=0
DEBUG=0

  • Show output of command nvidia-smi
    图片

@AlexeyAB
Copy link
Owner

What error can you get by using this command?
sudo ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg

cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: Permission denied

May be something wrong with your permissions or with cuDNN.

@2679622694
Copy link
Author

@AlexeyAB

  • What error can you get by using this command?
  `sudo ./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights obj.jpg`

Just like following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

By the way , I notice this repo
improved neural network performance ~7% by fusing 2 layers into 1: Convolutional + Batch-norm
I want to know:

  • The improved neural network performance ~7% is mean to the improvement in mAP ?

  • if I set GPU=1 CUDNN=0 CUDNN_HALF=0 in Makefile , can I still get this improved neural network performance ~7% after training?
    Since there are some issue with my cdDNN, I can not train when I set GPU=1 CUDNN=1 CUDNN_HALF=1 in Makefile. I can only train when GPU=1 CUDNN=0 CUDNN_HALF=0. So what I concern is that if CUDNN=0 CUDNN_HALF=0 in Makefile has an impact on improved neural network performance ~7%.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 2, 2019

Its about speed.

To increase accuracy you should use new model: https://raw.githubusercontent.com/WongKinYiu/CrossStagePartialNetworks/master/cfg/csresnext50-panet-spp.cfg


So do you get an error only if you train with CUDA without cuDNN?

@AlexeyAB AlexeyAB added the Likely bug Maybe a bug, maybe not label Dec 2, 2019
@2679622694
Copy link
Author

@AlexeyAB

So do you get an error only if you train with CUDA without cuDNN?

  • if set GPU=1 CUDNN=1 CUDNN_HALF=1 in Makefile ,then I can run make successful , but can not train as shown following:

cuDNN status Error in: file: ./src/convolutional_kernels.cu : () : line: 541 : build time: Nov 27 2019 - 14:30:49
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED
cuDNN Error: CUDNN_STATUS_EXECUTION_FAILED: File exists
darknet: ./src/utils.c:295: error: Assertion `0' failed.

  • if set GPU=1 CUDNN=0 CUDNN_HALF=0 in Makefile ,then I can run make successful , and can also train

By the way, I find that, I use the same dataset and obj.cfg to train in this repo can achieve higher mAP compared to train in pjreddie/darknet repo.

The first time I use my own dataset to train yolov3.cfg in pjreddie/darknet repo.
I train 50k steps. After training, I use yolov3.cfg and my final yolov3.weights to calculate mAP in this repo(not pjreddie/darknet repo) . The command is like following:

./darknet detector map my_obj.data yolov3.cfg train-in-pjreddie/yolov3.weights -points 0

With this command ,it shows mAP is 80.63

The second time I use the same dataset to train yolov3.cfg in this repo
I also train 50k steps. After training, I still use yolov3.cfg and the final yolov3.weights to calculate mAP in this repo . The command is like following:

./darknet detector map my_obj.data yolov3.cfg train-in-AlexeyAB/yolov3.weights -points 0

But this time,it shows mAP is 85.36

The dataset and yolov3.cfg that I use are the same.

  • Why I can get a mAP improment when training in this repo?

  • What have you done in this repo to improve the mAP?

  • Or which one of the following contribute to the improvement in mAP?

图片

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 3, 2019

@2679622694

Why I can get a mAP improment when training in this repo?

Different resize approaches: #232 (comment)

What have you done in this repo to improve the mAP?

Added new layers, new params, new features and new models... https://github.com/AlexeyAB/darknet/projects/1

Or which one of the following contribute to the improvement in mAP?

In your case, there are simply different approaches to resizing.

@15966697671
Copy link

@AlexeyAB

In my case, training yolov3.cfg in this repo can also get +4% improvement in mAP compared to training yolov3.cfg in pjreddie/darknet repo

The dataset and yolov3.cfg are same when training in this repo and pjreddie/darknet repo

I set width=608 height=608 and random=1 in yolov3.cfg for training and testing.

I use the following command to calculate mAP in this repo for both of two models after training:

./darknet detector my.data cfg/yolov3.cfg backup/best.weights -points 0

  • this repo does not keep aspect ratio of the image when resizing, whilepjreddie/darknet repokeep aspect ratio of the image Resizing : keeping aspect ratio, or not #232 (comment)
    Does this factor (do not keep aspect ratio of the image when resizing) all contribute to the improvement in mAP in my case?
    If not , is there any factor that contribute to the improvement in mAP in my case?

  • I notice that , when setting width=608 height=608 and random=1 in yolov3.cfg ,this repo resizes network size from 608/1.4 to 608x1.4.
    If setting width=608 height=608 and random=1 in yolov3.cfg ,pjreddie/darknet repo resizes network size from 320 to 608
    These two ranges are different , and the maximum size in this repo (608x1.4) is larger than the maximum size in this repo (608)
    So this factor (these two ranges and the maximum size are different) contribute the improvement in mAP in my case?

  • Except for the two factors above ,is there any factor that contribute to the improvement in mAP in my case?

By the way, I want to save .weights per 1k steps(just like this repo) or 5k steps in pjreddie/darknet repo

  • Where do I need to change to code?

Thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Likely bug Maybe a bug, maybe not
Projects
None yet
Development

No branches or pull requests

3 participants