Multi-GPU Training+Single-GPU Eval runs into time out #223

cronoik · 2021-12-31T13:22:00Z

Hi everyone,

we run into a timeout when we evaluate for more than 30 minutes on a single GPU. Is there a way to tell the other GPU to wait until the main GPU completes the evaluation?

[E ProcessGroupNCCL.cpp:587] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1802157 milliseconds before timing out. 
Traceback (most recent call last): 
Traceback (most recent call last): 
 File "scripts/training_test.py", line 172, in <module> 
 File "scripts/training_test.py", line 172, in <module> 
   main(args)    
main(args) File "scripts/training_test.py", line 167, in main 
 
 File "scripts/training_test.py", line 167, in main 
   train(args.config) 
 File "scripts/training_test.py", line 148, in train 
   train(args.config) 
 File "scripts/training_test.py", line 148, in train 
   trainer.train_pipeline() 
 File "/home/azureuser/mytrainer.py", line 182, in train_pipeline 
   trainer.train_pipeline() 
 File "/home/azureuser/mytrainer.py", line 182, in train_pipeline 
       for step, batch in enumerate(pbar):for step, batch in enumerate(pbar): 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__ 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/tqdm/std.py", line 1168, in __iter__ 
       for obj in iterable:for obj in iterable: 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__ 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/data_loader.py", line 301, in __iter__ 
       synchronize_rng_states(self.rng_types, self.generator)synchronize_rng_states(self.rng_types, self.generator) 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 110, in synchronize_rng_states 
       synchronize_rng_state(RNGType(rng_type), generator=generator)synchronize_rng_state(RNGType(rng_type), generator=generator) 
 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state 
 File "/anaconda/envs/cronoik_test/lib/python3.8/site-packages/accelerate/utils.py", line 105, in synchronize_rng_state 
       generator.set_state(rng_state)generator.set_state(rng_state) 
 
RuntimeErrorRuntimeError: : Invalid mt19937 stateInvalid mt19937 state 
 
^MEvaluating ... : 47%|███████████████████████████▎                             | 2550/5411 [30:02<29:18, 1.63it/s][E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802593 milliseconds before timing out. 
^MEvaluating ... : 47%|███████████████████████████▎                             | 2551/5411 [30:02<29:01, 1.64it/s][E ProcessGroupNCCL.cpp:587] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout
(ms)=1800000) ran for 1802971 milliseconds before timing out.

@sgugger Can you please have a look?

The text was updated successfully, but these errors were encountered:

sgugger · 2022-01-10T11:39:32Z

The timeout et 30s comes from PyTorch, but you can adjust it when initializing the distributed process. Accelerate does it automatically but only if you haven't done it yourself in the script. I'll expose that argument this week or the next, but in the meantime, you can use this line as a workaround:

torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))

The default is 3600.

santarabantoosoo · 2023-04-01T19:00:30Z

I didn't know how to set this as an argument if I am using python -m torch.distributed.launch

I found the argument ddp_timeout in the trainingargs and I have used it.

I am commenting on this for:
1- making sure I am correct
2- Helping others if they face the same issue

Here is my example terminal command

python -m torch.distributed.launch \
    --nproc_per_node 8 run_mlm.py \
    --ddp_timeout 7200 \
    --fp16 \
    --model_name_or_path bert-base-cased \
    --train_file data/msgs_train_online_text-all_disch-all.txt \
    --validation_file msgs_valid.txt \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 16 \
    --do_train \
    --do_eval \
    --output_dir /models/msgs_online_disch_MLM \
    --overwrite_output_dir

yawzhe · 2024-03-18T11:22:36Z

我谁知了ddp，但是运行llama_factory时候，要加载两次 tokenzier,第一次160w顺利加载，第二次就报错了

yawzhe · 2024-03-18T11:23:02Z

Neo9061 · 2024-06-25T15:57:10Z

The timeout et 30s comes from PyTorch, but you can adjust it when initializing the distributed process. Accelerate does it automatically but only if you haven't done it yourself in the script. I'll expose that argument this week or the next, but in the meantime, you can use this line as a workaround:
torch.distributed.init_process_group(backend="nccl", timeout=datetime.timedelta(seconds=xxx))
The default is 3600.

@sgugger if I used with FSDP distributed fine-tuning, is the parameter controlled by ddp_timeout? I modified this argument but still see timeout. See my second issue in huggingface/transformers#31577

sgugger mentioned this issue Jan 10, 2022

Add customization point for init_process_group kwargs #228

Merged

sgugger closed this as completed in #228 Jan 11, 2022

hiyouga mentioned this issue Jun 25, 2023

多卡训练lora超时 hiyouga/LLaMA-Factory#74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training+Single-GPU Eval runs into time out #223

Multi-GPU Training+Single-GPU Eval runs into time out #223

cronoik commented Dec 31, 2021 •

edited

Loading

sgugger commented Jan 10, 2022

santarabantoosoo commented Apr 1, 2023

yawzhe commented Mar 18, 2024

yawzhe commented Mar 18, 2024

Neo9061 commented Jun 25, 2024

Multi-GPU Training+Single-GPU Eval runs into time out #223

Multi-GPU Training+Single-GPU Eval runs into time out #223

Comments

cronoik commented Dec 31, 2021 • edited Loading

sgugger commented Jan 10, 2022

santarabantoosoo commented Apr 1, 2023

yawzhe commented Mar 18, 2024

yawzhe commented Mar 18, 2024

Neo9061 commented Jun 25, 2024

cronoik commented Dec 31, 2021 •

edited

Loading