Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FCOS3D Inference Error: AssertionError: loss log variables are different across GPUs! #1635

Closed
kaixinbear opened this issue Jul 18, 2022 · 2 comments

Comments

@kaixinbear
Copy link

When after inferencing and start evaling, I came cross this bug:
Below are logs:
86%|########5 | 5173/6019 [00:09<00:01, 785.89it/s]
87%|########7 | 5260/6019 [00:09<00:01, 694.86it/s]
89%|########8 | 5336/6019 [00:10<00:01, 583.32it/s]
90%|########9 | 5401/6019 [00:10<00:01, 567.95it/s]
94%|#########3| 5651/6019 [00:10<00:00, 998.96it/s]
97%|#########7| 5853/6019 [00:10<00:00, 1245.09it/s]

[E ProcessGroupNCCL.cpp:566] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806871 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806761 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1806762 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807369 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807403 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:566] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807494 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 253, in
main()
File "tools/train.py", line 248, in main
meta=meta)
File "/running_package/detr3d-main/mmdetection3d/mmdet3d/apis/train.py", line 71, in train_model
meta=meta)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/apis/train.py", line 208, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
**kwargs)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 249, in train_step
loss, log_vars = self._parse_losses(losses)
File "/running_package/detr3d-main/DETR3D/lib/python3.6/site-packages/mmdet/models/detectors/base.py", line 209, in _parse_losses
'loss log variables are different across GPUs!\n' + message
AssertionError: loss log variables are different across GPUs!

@Tai-Wang
Copy link
Member

Tai-Wang commented Aug 3, 2022

@lianqing11 Please share your idea here if you have any successful experiences afterward.

@lianqing11
Copy link
Collaborator

Hi @kaixinbear , Could you try to replace the code in train.py from

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher, **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

to

    # init distributed env first, since logger depends on the dist info.
    if args.launcher == 'none':
        distributed = False
    else:
        distributed = True
        init_dist(args.launcher,
                      timeout=datetime.timedelta(seconds=18000),
                      **cfg.dist_params)
        # re-set gpu_ids with distributed training mode
        _, world_size = get_dist_info()
        cfg.gpu_ids = range(world_size)

After doing this, I have not met this error again. I guess this error is because the inference takes too much time, which is longer than the default timeout. (see this issue)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants