Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watchdog caught collective operation timeout #14

Open
minji-o-j opened this issue Jul 27, 2023 · 4 comments
Open

Watchdog caught collective operation timeout #14

minji-o-j opened this issue Jul 27, 2023 · 4 comments
Labels

Comments

@minji-o-j
Copy link

minji-o-j commented Jul 27, 2023

27 Jul 06:20    INFO Soft link created: saved/PTG-dd-2023-Jul-27_01-10-24/checkpoint_best -> /workspace/TextBox/saved/PTG-dd-2023-Jul-27_01-10-24/checkpoint_epoch-5
27 Jul 06:20    INFO ====== Finished training, best validation result at train epoch 5 ======
27 Jul 06:20    INFO Best valid result: score: 65.06, <bleu-1: 33.65>, <bleu-2: 31.41>, bleu-3: 33.47, bleu-4: 32.60, distinct-1: 3.13, distinct-2: 12.90, distinct-3: 20.85, distinct-4: 26.43
27 Jul 06:20    INFO Loading model structure and parameters from saved/PTG-dd-2023-Jul-27_01-10-24/checkpoint_best ...

[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19450, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801798 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19450, OpType=ALLGATHER, Timeout(ms)=1800000) ran for 1801798 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 491) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 950, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
====================================================
run_textbox.py FAILED
----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-27_06:51:11
  host      : b94bc7c0de46
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 491)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 491
====================================================
@minji-o-j
Copy link
Author

WongKinYiu/yolov7#714

@minji-o-j
Copy link
Author

export NCCL_P2P_LEVEL=NVL
cmd에 이걸 한번씩 입력하자 !

@minji-o-j
Copy link
Author

01 Aug 11:14    INFO  Validation  6 [time: 537.22s, score: 84.19, <bleu-1: 45.70>, <bleu-2: 38.49>, bleu-3: 38.10, bleu-4: 35.58, distinct-1: 1.44, distinct-2: 7.43, distinct-3: 15.43, distinct-4: 24.47]
01 Aug 11:15    INFO Early stopped at 3 non-best validation.
01 Aug 11:15    INFO Soft link created: saved/PTG-pc-2023-Aug-01_03-24-10/checkpoint_best -> /workspace/TextBox/saved/PTG-pc-2023-Aug-01_03-24-10/checkpoint_epoch-3
01 Aug 11:15    INFO ====== Finished training, best validation result at train epoch 3 ======
01 Aug 11:15    INFO Best valid result: score: 87.05, <bleu-1: 48.09>, <bleu-2: 38.96>, bleu-3: 37.21, bleu-4: 33.89, distinct-1: 1.24, distinct-2: 6.65, distinct-3: 14.27, distinct-4: 22.85
01 Aug 11:15    INFO Loading model structure and parameters from saved/PTG-pc-2023-Aug-01_03-24-10/checkpoint_best ...
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=183888, OpType=ALLGATHER, Timeout(ms)=5400000) ran for 5408715 milliseconds before timing out.
01 Aug 12:45    ERROR Traceback (most recent call last):
  File "/workspace/TextBox/textbox/utils/dashboard.py", line 316, in new_experiment
    yield True
  File "/workspace/TextBox/textbox/quick_start/experiment.py", line 130, in run
    self._do_test()
  File "/workspace/TextBox/textbox/quick_start/experiment.py", line 112, in _do_test
    self.test_result = self.trainer.evaluate(self.test_data, load_best_model=self.do_train)
  File "/opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/TextBox/textbox/trainer/trainer.py", line 481, in evaluate
    self.model = self.accelerator.prepare(self.model)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1199, in prepare
    result = tuple(
  File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1200, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1027, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/accelerator.py", line 1295, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: DDP expects same model across all ranks, but Rank 0 has 521 params, while rank 1 has inconsistent 24 params.

[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=183888, OpType=ALLGATHER, Timeout(ms)=5400000) ran for 5408715 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 423) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 950, in launch_command
    multi_gpu_launcher(args)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/commands/launch.py", line 642, in multi_gpu_launcher
    distrib_run.run(args)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(

@zmtttt
Copy link

zmtttt commented Aug 30, 2024

hello! have you solved the problem? I met the same problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants