Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM?请问这个是什么错误呢? #14

Closed
wanghangege opened this issue Apr 14, 2021 · 4 comments
Closed

OOM?请问这个是什么错误呢? #14

wanghangege opened this issue Apr 14, 2021 · 4 comments

Comments

@wanghangege
Copy link

[04/14 16:15:25 d2.engine.hooks]: Total training time: 0:00:24 (0:00:00 on hooks)
[04/14 16:15:25 d2.utils.events]: iter: 0 lr: N/A max_mem: 7597M
Traceback (most recent call last):
File "./tools/train_net.py", line 234, in
args=(args,),
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 79, in launch
daemon=False,
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/launch.py", line 125, in _distributed_worker
main_func(*args)
File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/tools/train_net.py", line 221, in main
return trainer.train()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 480, in train
super().train(self.start_iter, self.max_iter)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/defaults.py", line 490, in run_step
self._trainer.run_step()
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/YOLOF/yolof/modeling/yolof.py", line 273, in forward
features = self.backbone(images.tensor)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 449, in forward
x = stage(x)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/container.py", line 119, in forward
input = module(input)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/modeling/backbone/resnet.py", line 201, in forward
out = self.conv3(out)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/wrappers.py", line 88, in forward
x = self.norm(x)
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/media/ubun/BE5A462D5A45E32F/detectron2/detectron2/layers/batch_norm.py", line 65, in forward
eps=self.eps,
File "/home/ubun/anaconda3/envs/detectrontwo/lib/python3.6/site-packages/torch/nn/functional.py", line 2150, in batch_norm
input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 2.04 GiB (GPU 1; 11.78 GiB total capacity; 5.89 GiB already allocated; 751.50 MiB free; 9.00 GiB reserved in total by PyTorch)

@chensnathan
Copy link
Owner

Hi,
Could you post your training log file? Details will be much more helpful for debugging.

@wanghangege
Copy link
Author

Hi,
Could you post your training log file? Details will be much more helpful for debugging.

OH, repeat:
First of all, thank you for your reply!
When I run the example command
" python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml ",
out of memory appears(before problems)!
So I adjusted the config to "Base-YOLOF.yaml" ,
and then adjusted this file's parameter "IMS_ PER_ Batch: 32 ", batch is 32 and can be run.
My environment is:
GPU: titan v * 2 = 11g * 2
pytorch=1.8.1,
python=3.6,
cuda10.1,cudnn=7.6.3。
Is it because of my memory problem that I made a mistake? Some config files don't have batch adjustment. Where can I define them?
Looking forward to your reply, thank you again!

@chensnathan
Copy link
Owner

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

@wanghangege
Copy link
Author

The settings are designed for 8 GPUs, you should adjust them when you use 2 GPUs according to the guidelines of Detectron2.

thank you, i solve it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants