Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

RuntimeError: CUDA error: out of memory #120

Closed
zimenglan-sysu-512 opened this issue Nov 6, 2018 · 23 comments
Closed

RuntimeError: CUDA error: out of memory #120

zimenglan-sysu-512 opened this issue Nov 6, 2018 · 23 comments
Labels
awaiting response question Further information is requested

Comments

@zimenglan-sysu-512
Copy link
Contributor

❓ Questions and Help

when train my own dataset using Resnet101 backbone after 27k iterations, it always encouters this problem as below:

File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 75, in do_train
    losses.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: out of memory

btw, the input size is set to be (800, 1333).

@fmassa
Copy link
Contributor

fmassa commented Nov 6, 2018

It's difficult to say where the problem comes from.

If your dataset might contain a large number of boxes in the same image, then I'd say that your issue might be related to #18, where we propose a few workaround solutions.

Apart from that, without further information it's difficult to say what else could be causing the OOM.

@fmassa fmassa added question Further information is requested awaiting response labels Nov 6, 2018
@zimenglan-sysu-512
Copy link
Contributor Author

thanks @fmassa

1 similar comment
@zimenglan-sysu-512
Copy link
Contributor Author

thanks @fmassa

@zimenglan-sysu-512
Copy link
Contributor Author

zimenglan-sysu-512 commented Nov 13, 2018

hi @fmassa

when i reduce the IMS_PER_BATCH to 8 for 8 GPUs and use resnet50 as backbone, to train my own dataset, it encounters the problem as below:

File "maskrcnn-benchmark/maskrcnn_benchmark/structures/boxlist_ops.py", line 84, in boxlist_iou
    wh = (rb - lt + TO_REMOVE).clamp(min=0)  # [N,M,2]
RuntimeError: CUDA error: out of memory

do u have any suggestions to solve this problem?
thanks!

@fmassa
Copy link
Contributor

fmassa commented Nov 13, 2018

Do you have a large number of boxes per image in your dataset?
If that's the case, then your problem might be related to #18 , and a possible solution is to move IoU computation to the CPU while we don't add custom kernels for box IoU

@zimenglan-sysu-512
Copy link
Contributor Author

hi @fmassa
the maximum number of gt boxes in my dataset is 60. i have no idea to deal with it.

@fmassa
Copy link
Contributor

fmassa commented Nov 14, 2018

This is the maximum number of boxes in a single image?
Can you try making the box iou computation run on the CPU, as I explained just before, and see if you run out of memory?

@zimenglan-sysu-512
Copy link
Contributor Author

hi @fmassa
yet, it's in a single image. I have tried what u say, but meet other problems. i will give my results on cpu mode after i fix these problems.

@zimenglan-sysu-512
Copy link
Contributor Author

zimenglan-sysu-512 commented Nov 14, 2018

hi @fmassa
i add the code after this line as below:

    USE_CPU_MODE = True
    if USE_CPU_MODE and N >= 20:
        device = box1.device
        box1 = box1.cpu() # ground-truths
        box2 = box2.cpu() # predictions
        lt = torch.max(box1[:, None, :2], box2[:, :2]).cpu()  # [N,M,2]
        rb = torch.min(box1[:, None, 2:], box2[:, 2:]).cpu()  # [N,M,2]

        TO_REMOVE = 1

        wh = (rb - lt + TO_REMOVE).clamp(min=0).cpu()  # [N,M,2]
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        iou = inter.cpu() / (area1[:, None].cpu() + area2.cpu() - inter.cpu())
        iou = iou.to(device)
        return iou

if the number of gt boxes is lager or equal to 20, use cpu to compute IoU, otherwise use gpu mode. beside, i use multi-scales (=(700, 800, 900)), set MAX_SIZE_TRAIN to 1440 and use single image per gpu. finally it works, but the speed slows a lot ( about 16% more time than gpu mode, and the gpu memory of one or two of gpus reaches 9489MiB).

thanks for your help @fmassa

@fmassa
Copy link
Contributor

fmassa commented Nov 14, 2018

Here is a simplified implementation:

device = box1.device
if USE_CPU_MODE and N >= 20:
    box1 = box1.cpu()
    box2 = box2.cpu()
...
# as before, no need to cast
# to .cpu() all the time
return iou.to(device)

So, just to see if I understand it properly, now your OOM error is gone, is that right?

This issue will be better fixed once we add a box iou implementation which is entirely in cuda. This will save a lot of memory I think.

@zimenglan-sysu-512
Copy link
Contributor Author

hi @fmassa
it is still OOM as below:

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 222, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 119, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 91, in __call__
    labels, regression_targets = self.prepare_targets(anchors, targets)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 55, in prepare_targets
    anchors_per_image, targets_per_image
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 38, in match_targets_to_anchors
    matched_idxs = self.proposal_matcher(match_quality_matrix)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/matcher.py", line 85, in __call__
    self.set_low_quality_matches_(matches, all_matches, match_quality_matrix)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/matcher.py", line 101, in set_low_quality_matches_
    match_quality_matrix == highest_quality_foreach_gt[:, None]
RuntimeError: CUDA error: out of memory

@zimenglan-sysu-512
Copy link
Contributor Author

hi @fmassa
if i reduce the input size, it can solve th OOM. but another problem is that if i use GTX TiTan instead of 1080 Ti, the training procedure will be held on and get stuck. it is weird.

@fmassa
Copy link
Contributor

fmassa commented Nov 15, 2018

About the OOM, it might be due to many reasons, and I might need more information on the particularities of your dataset to be able to help you more.

About the hang, are you still using the same machine or different machines?
If you are using different machines, maybe your nvidia drivers are not up-to-date and you are facing deadlocks similarly to #58 ?

@zimenglan-sysu-512
Copy link
Contributor Author

hi @fmassa
my own dataset has 17 categories, and the maximum number of gt boxes in one image is 26. the image is not to large, the max size of these images is less then 1200. btw, my driver version is as below:

NVRM version: NVIDIA UNIX x86_64 Kernel Module  384.130  Wed Mar 21 03:37:26 PDT 2018

thanks

@zimenglan-sysu-512
Copy link
Contributor Author

zimenglan-sysu-512 commented Nov 16, 2018

hi @fmassa
i update the driver from 384 to 390, the training procedure still hangs, and i use cuda8.0.61 and GTX Titan (12G) card. by the way, i use cpu to compute the IoU, the memory looks a litte strange, as below:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     13630      C   /usr/bin/python3.6                          5411MiB |
|    1     13631      C   /usr/bin/python3.6                          5325MiB |
|    2     13632      C   /usr/bin/python3.6                          5009MiB |
|    3     13633      C   /usr/bin/python3.6                          4339MiB |
|    4     13634      C   /usr/bin/python3.6                          5097MiB |
|    5     13635      C   /usr/bin/python3.6                          4873MiB |
|    6     13637      C   /usr/bin/python3.6                         11099MiB |
|    7     13638      C   /usr/bin/python3.6                          4231MiB |
+-----------------------------------------------------------------------------+

OOM as below:
Tried to allocate 7.09 GiB (GPU 6; 10.92 GiB total capacity; 3.73 GiB already allocated; 6.09 GiB free; 50.97 MiB cached)

@fmassa
Copy link
Contributor

fmassa commented Nov 16, 2018

I think there might be some incompatibilities with your driver and your CUDA version.

So, by checking your previous driver version (384.130), you can see from here that it was before the bugfix, and thus the hang.

Can you update to CUDA 9.2 and install driver >=396.26 ? This will definitely fix your problems.

@zimenglan-sysu-512
Copy link
Contributor Author

thanks @fmassa .
after update ubuntu 14.04 to 16.04, i will try what u suggest, and then report my results here.
thanks again.

@zimenglan-sysu-512
Copy link
Contributor Author

zimenglan-sysu-512 commented Nov 19, 2018

hi @fmassa,

The OOM problem has been solved. beacuse i duplicated the ground-truths several times, making the number of gt bboxes to be 2k. (very sorry for that). btw, if using cpu to compute the IoUs for prediction and gt, not only need to modify these lines, but also need to pay attention to the few lines: so that it can deal with a large amout of gt bboxes at cost of slowing the training speed (maybe training time is doubled).

about the hanging, since i upgrade ubuntu 14.04 to 16.04, install cuda 9.0 (or cuda 9.2) with difference nvidia-drivers (390, 396, 410), it sometime happens. as @chengyangfu said, when use nvidia-driver 410, the frequency is much lower.

thanks!

@fmassa
Copy link
Contributor

fmassa commented Nov 19, 2018

Cool, great that it's working now.

About the modifications, I'd say that you could move the data back to GPU in the end of boxlist if you have enough memory to hold it.

Let us know if you have further questions.

@yuchenrao-bg
Copy link

I also have the same problem. I noticed that when N > 200 (maybe smaller than 200) will show this error. I didn't change the calculation to cpu. I just use torch.cuda.empty_cache() fro each batch, which seems okay for my situation.

@hetolin
Copy link

hetolin commented Jul 13, 2020

hi @fmassa
it is still OOM as below:

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 66, in do_train
    loss_dict = model(images, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/deprecated/distributed.py", line 222, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
    proposals, proposal_losses = self.rpn(images, features, targets)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 100, in forward
    return self._forward_train(anchors, objectness, rpn_box_regression, targets)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/rpn.py", line 119, in _forward_train
    anchors, objectness, rpn_box_regression, targets
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 91, in __call__
    labels, regression_targets = self.prepare_targets(anchors, targets)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 55, in prepare_targets
    anchors_per_image, targets_per_image
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/loss.py", line 38, in match_targets_to_anchors
    matched_idxs = self.proposal_matcher(match_quality_matrix)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/matcher.py", line 85, in __call__
    self.set_low_quality_matches_(matches, all_matches, match_quality_matrix)
  File "maskrcnn-benchmark/maskrcnn_benchmark/modeling/matcher.py", line 101, in set_low_quality_matches_
    match_quality_matrix == highest_quality_foreach_gt[:, None]
RuntimeError: CUDA error: out of memory

hi @zimenglan-sysu-512 , I tried making the box iou computation run on the CPU as you do:
USE_CPU_MODE = True if USE_CPU_MODE and N >= 20: device = box1.device ... iou = iou.to(device) return iou
but I meet the same errors as above. how did you solve that? Is it necessary to modify something in /maskrcnn_benchmark/modeling/mather.py?

@hetolin
Copy link

hetolin commented Jul 13, 2020

I also have the same problem. I noticed that when N > 200 (maybe smaller than 200) will show this error. I didn't change the calculation to cpu. I just use torch.cuda.empty_cache() fro each batch, which seems okay for my situation.

hi @yuchenrao-bg Could you please tell me where you add torch.cuda.empty_cache()? in which file? I met the same problems

@yuchenrao-bg
Copy link

I also have the same problem. I noticed that when N > 200 (maybe smaller than 200) will show this error. I didn't change the calculation to cpu. I just use torch.cuda.empty_cache() fro each batch, which seems okay for my situation.

hi @yuchenrao-bg Could you please tell me where you add torch.cuda.empty_cache()? in which file? I met the same problems

Sorry for late reply. I don't remember it clearly but I think you can add it in the training code.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
awaiting response question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants