OOM?请问这个是什么错误呢？ #14

wanghangege · 2021-04-14T08:16:19Z

[04/14 16:15:25 d2.engine.hooks]: Total training time: 0:00:24 (0:00:00 on hooks)
[04/14 16:15:25 d2.utils.events]: iter: 0 lr: N/A max_mem: 7597M
Traceback (most recent call last):
File "./tools/train_net.py", line 234, in
args=(args,),
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker
main_func(*args)
File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/tools/train_net.py", line 221, in main
return trainer.train()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 480, in train
super().train(self.start_iter, self.max_iter)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 490, in run_step
self._trainer.run_step()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/yolof/modeling/yolof.py", line 273, in forward
features = self.backbone(images.tensor)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 449, in forward
x = stage(x)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward
out = self.conv3(out)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/wrappers.py", line 88, in forward
x = self.norm(x)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/batch_norm.py", line 65, in forward
eps=self.eps,
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/functional.py", line 2150, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 2.04 GiB (GPU 1; 11.78 GiB total capacity; 5.89 GiB already allocated; 751.50 MiB free; 9.00 GiB reserved in total by PyTorch)

chensnathan · 2021-04-14T08:23:28Z

Hi,
Could you post your training log file? Details will be much more helpful for debugging.

wanghangege · 2021-04-14T09:03:38Z

Hi,
Could you post your training log file? Details will be much more helpful for debugging.

OH, repeat:
First of all, thank you for your reply!
When I run the example command
" python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml ",
out of memory appears(before problems)!
So I adjusted the config to "Base-YOLOF.yaml" ,
and then adjusted this file's parameter "IMS_ PER_ Batch: 32 ", batch is 32 and can be run.
My environment is:
GPU: titan v * 2 = 11g * 2
pytorch=1.8.1,
python=3.6，
cuda10.1,cudnn=7.6.3。
Is it because of my memory problem that I made a mistake? Some config files don't have batch adjustment. Where can I define them?
Looking forward to your reply, thank you again!

chensnathan · 2021-04-14T09:19:56Z

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

wanghangege · 2021-04-14T10:40:44Z

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

thank you, i solve it!

chensnathan closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM?请问这个是什么错误呢？ #14

OOM?请问这个是什么错误呢？ #14

wanghangege commented Apr 14, 2021

chensnathan commented Apr 14, 2021

wanghangege commented Apr 14, 2021

chensnathan commented Apr 14, 2021

wanghangege commented Apr 14, 2021

OOM?请问这个是什么错误呢？ #14

OOM?请问这个是什么错误呢？ #14

Comments

wanghangege commented Apr 14, 2021

chensnathan commented Apr 14, 2021

wanghangege commented Apr 14, 2021

chensnathan commented Apr 14, 2021

wanghangege commented Apr 14, 2021