-
Notifications
You must be signed in to change notification settings - Fork 2.5k
RuntimeError: CUDA error: out of memory #120
Comments
It's difficult to say where the problem comes from. If your dataset might contain a large number of boxes in the same image, then I'd say that your issue might be related to #18, where we propose a few workaround solutions. Apart from that, without further information it's difficult to say what else could be causing the OOM. |
thanks @fmassa |
1 similar comment
thanks @fmassa |
hi @fmassa when i reduce the
do u have any suggestions to solve this problem? |
Do you have a large number of boxes per image in your dataset? |
hi @fmassa |
This is the maximum number of boxes in a single image? |
hi @fmassa |
hi @fmassa
if the number of gt boxes is lager or equal to 20, use cpu to compute IoU, otherwise use gpu mode. beside, i use multi-scales (=(700, 800, 900)), set thanks for your help @fmassa |
Here is a simplified implementation:
So, just to see if I understand it properly, now your OOM error is gone, is that right? This issue will be better fixed once we add a box iou implementation which is entirely in cuda. This will save a lot of memory I think. |
hi @fmassa
|
hi @fmassa |
About the OOM, it might be due to many reasons, and I might need more information on the particularities of your dataset to be able to help you more. About the hang, are you still using the same machine or different machines? |
hi @fmassa
thanks |
hi @fmassa
OOM as below: |
I think there might be some incompatibilities with your driver and your CUDA version. So, by checking your previous driver version (384.130), you can see from here that it was before the bugfix, and thus the hang. Can you update to CUDA 9.2 and install driver >=396.26 ? This will definitely fix your problems. |
thanks @fmassa . |
hi @fmassa, The OOM problem has been solved. beacuse i duplicated the ground-truths several times, making the number of gt bboxes to be 2k. (very sorry for that). btw, if using cpu to compute the IoUs for prediction and gt, not only need to modify these lines, but also need to pay attention to the few lines: so that it can deal with a large amout of gt bboxes at cost of slowing the training speed (maybe training time is doubled). about the hanging, since i upgrade ubuntu 14.04 to 16.04, install cuda 9.0 (or cuda 9.2) with difference nvidia-drivers (390, 396, 410), it sometime happens. as @chengyangfu said, when use nvidia-driver 410, the frequency is much lower. thanks! |
Cool, great that it's working now. About the modifications, I'd say that you could move the data back to GPU in the end of boxlist if you have enough memory to hold it. Let us know if you have further questions. |
I also have the same problem. I noticed that when N > 200 (maybe smaller than 200) will show this error. I didn't change the calculation to cpu. I just use |
hi @zimenglan-sysu-512 , I tried making the box iou computation run on the CPU as you do: |
hi @yuchenrao-bg Could you please tell me where you add |
Sorry for late reply. I don't remember it clearly but I think you can add it in the training code. |
❓ Questions and Help
when train my own dataset using Resnet101 backbone after 27k iterations, it always encouters this problem as below:
btw, the input size is set to be (800, 1333).
The text was updated successfully, but these errors were encountered: