You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When after inferencing and start evaling, I came cross this bug:
Below are logs:
86%|########5 | 5173/6019 [00:09<00:01, 785.89it/s]
87%|########7 | 5260/6019 [00:09<00:01, 694.86it/s]
89%|########8 | 5336/6019 [00:10<00:01, 583.32it/s]
90%|########9 | 5401/6019 [00:10<00:01, 567.95it/s]
94%|#########3| 5651/6019 [00:10<00:00, 998.96it/s]
97%|#########7| 5853/6019 [00:10<00:00, 1245.09it/s]
[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807369 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807403 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807494 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 253, in
main()
File "tools/train.py", line 248, in main
meta=meta)
File "/running_package/detr3d-main/mmdetection3d/mmdet3d/apis/train.py", line 71, in train_model
meta=meta)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 249, in train_step
loss, log_vars = self._parse_losses(losses)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 209, in _parse_losses
'loss log variables are different across GPUs!\n' + message
AssertionError: loss log variables are different across GPUs!
The text was updated successfully, but these errors were encountered:
# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
distributed = False
else:
distributed = True
init_dist(args.launcher, **cfg.dist_params)
# re-set gpu_ids with distributed training mode
_, world_size = get_dist_info()
cfg.gpu_ids = range(world_size)
to
# init distributed env first, since logger depends on the dist info.
if args.launcher == 'none':
distributed = False
else:
distributed = True
init_dist(args.launcher,
timeout=datetime.timedelta(seconds=18000),
**cfg.dist_params)
# re-set gpu_ids with distributed training mode
_, world_size = get_dist_info()
cfg.gpu_ids = range(world_size)
After doing this, I have not met this error again. I guess this error is because the inference takes too much time, which is longer than the default timeout. (see this issue)
When after inferencing and start evaling, I came cross this bug:
Below are logs:
86%|########5 | 5173/6019 [00:09<00:01, 785.89it/s]
87%|########7 | 5260/6019 [00:09<00:01, 694.86it/s]
89%|########8 | 5336/6019 [00:10<00:01, 583.32it/s]
90%|########9 | 5401/6019 [00:10<00:01, 567.95it/s]
94%|#########3| 5651/6019 [00:10<00:00, 998.96it/s]
97%|#########7| 5853/6019 [00:10<00:00, 1245.09it/s]
[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807369 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807403 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807494 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 253, in
main()
File "tools/train.py", line 248, in main
meta=meta)
File "/running_package/detr3d-main/mmdetection3d/mmdet3d/apis/train.py", line 71, in train_model
meta=meta)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 249, in train_step
loss, log_vars = self._parse_losses(losses)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 209, in _parse_losses
'loss log variables are different across GPUs!\n' + message
AssertionError: loss log variables are different across GPUs!
The text was updated successfully, but these errors were encountered: