FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

kaixinbear · 2022-07-18T06:50:37Z

When after inferencing and start evaling, I came cross this bug:
Below are logs:
86%|########5 | 5173/6019 [00:09<00:01, 785.89it/s]
87%|########7 | 5260/6019 [00:09<00:01, 694.86it/s]
89%|########8 | 5336/6019 [00:10<00:01, 583.32it/s]
90%|########9 | 5401/6019 [00:10<00:01, 567.95it/s]
94%|#########3| 5651/6019 [00:10<00:00, 998.96it/s]
97%|#########7| 5853/6019 [00:10<00:00, 1245.09it/s]

[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807369 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807403 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807494 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 253, in
main()
File "tools/train.py", line 248, in main
meta=meta)
File "/running_package/detr3d-main/mmdetection3d/mmdet3d/apis/train.py", line 71, in train_model
meta=meta)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 249, in train_step
loss, log_vars = self._parse_losses(losses)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 209, in _parse_losses
'loss log variables are different across GPUs!\n' + message
AssertionError: loss log variables are different across GPUs!

Tai-Wang · 2022-08-03T06:24:04Z

@lianqing11 Please share your idea here if you have any successful experiences afterward.

lianqing11 · 2022-08-05T04:59:32Z

Hi @kaixinbear , Could you try to replace the code in train.py from

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

to

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher,
                      timeout=datetime.timedelta(seconds=18000),
                      **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

After doing this, I have not met this error again. I guess this error is because the inference takes too much time, which is longer than the default timeout. (see this issue)

Tai-Wang added the reimplementation label Aug 3, 2022

Tai-Wang closed this as completed Aug 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

kaixinbear commented Jul 18, 2022

Tai-Wang commented Aug 3, 2022

lianqing11 commented Aug 5, 2022

FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

Comments

kaixinbear commented Jul 18, 2022

Tai-Wang commented Aug 3, 2022

lianqing11 commented Aug 5, 2022