Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default process group is not initialized #10

Closed
Ilin3170 opened this issue Apr 2, 2021 · 2 comments
Closed

Default process group is not initialized #10

Ilin3170 opened this issue Apr 2, 2021 · 2 comments

Comments

@Ilin3170
Copy link

Ilin3170 commented Apr 2, 2021

[04/02 17:29:16 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model:
backbone.classifier.{bias, weight}
[04/02 17:29:16 d2.engine.train_loop]: Starting training from iteration 0
ERROR [04/02 17:29:18 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 138, in train
self.run_step()
File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 232, in run_step
loss_dict = self.model(data)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/YOLOF/yolof/modeling/yolof.py", line 273, in forward
features = self.backbone(images.tensor)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/YOLOF/yolof/modeling/backbone/darknet.py", line 368, in forward
x = self.bn1(x)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
return _get_group_size(group)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
[04/02 17:29:18 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
[04/02 17:29:18 d2.utils.events]: iter: 0 lr: N/A max_mem: 1401M
Traceback (most recent call last):
File "./tools/train_net.py", line 234, in
args=(args,),
File "/home/wangyixin/detectron2/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "./tools/train_net.py", line 221, in main
return trainer.train()
File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 431, in train
super().train(self.start_iter, self.max_iter)
File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 138, in train
self.run_step()
File "/home/wangyixin/detectron2/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/wangyixin/detectron2/detectron2/engine/train_loop.py", line 232, in run_step
loss_dict = self.model(data)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/YOLOF/yolof/modeling/yolof.py", line 273, in forward
features = self.backbone(images.tensor)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/YOLOF/yolof/modeling/backbone/darknet.py", line 368, in forward
x = self.bn1(x)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 519, in forward
world_size = torch.distributed.get_world_size(process_group)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 625, in get_world_size
return _get_group_size(group)
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 220, in _get_group_size
_check_default_pg()
File "/home/wangyixin/miniconda3/envs/py3.7/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized

@Ilin3170
Copy link
Author

Ilin3170 commented Apr 2, 2021

run with python ./tools/train_net.py --config-file ./configs/yolof_CSP_D_53_DC5_3x.yaml 报错,run python ./tools/train_net.py --config-file ./configs/yolof_R_50_C5_1x.yaml 就没问题,请问是哪里出了问题呢?

@chensnathan
Copy link
Owner

yolof_CSP_D_53_DC5_3x uses SyncBN while yolof_R_50_C5_1x uses normal BN.

You can run yolof_CSP_D_53_DC5_3x with multiple GPUs or replace all SyncBN with BN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants