Synchronous Multi-node RL #152

rafapi · 2024-12-18T19:42:04Z

Synchronous Multi-node RL

Distributed Training Enhancements

Robust distributed training support with multiple nodes
New DistributedManager class to handle distributed operations
File-based synchronization mechanisms for non-distributed phases
Handling of multi-node NCCL configurations

DeepSpeed Integration

New DeepSpeed configuration for multi-node scenarios (ds_multinode.json)
Fixed DeepSpeed checkpoint handling and model saving
Enhanced optimizer configurations for distributed training
Added support for CPU Adam optimizer in distributed settings

Data Processing Improvements

Implemented data splitting across nodes for better load distribution
Added sample gathering from multiple ranks
Improved error handling and retry mechanisms for data loading

Memory Management

Added GPU resource cleanup functionality
Improved memory status monitoring
Added CUDA cache management

Logging and Monitoring

Enhanced W&B logging for distributed training
Added rank-aware logging
Improved error reporting across distributed processes
Added execution time tracking for different training phases

Synchronization Mechanisms

Added robust barrier implementations with timeout and retry logic
Implemented file-based synchronization for non-distributed phases
Added checks to ensure all nodes complete critical sections
Enhanced coordination between nodes during data processing and training

Best squeeze of throughput

rizar

Good stuff! Please see my comments below.

examples/rl_gsm8k/orchestrate_rl.py

examples/rl_gsm8k/utils.py

examples/rl_gsm8k/orchestrate_rl.py

rafapi · 2025-01-08T16:10:16Z

Refactoring to mitigate NCCL timeouts and handle multi-node rollout generation

https://wandb.ai/rafa/tapeagents/runs/rl_gsm8k_deepspeed_llama31_70b_new_save_rl_no_kl_lr1e-6_2nodes_take5_finetune?nw=nwuserrafa

https://wandb.ai/rafa/tapeagents/runs/rl_gsm8k_deepspeed_llama31_70b_new_save_rl_no_kl_lr1e-6_2nodes_take5?nw=nwuserrafa

rafapi · 2025-01-15T19:45:50Z

https://wandb.ai/rafa/tapeagents/runs/rl_gsm8k_deepspeed_llama31_70b_rl_no_kl_lr1e-6_2nodes_no_gather_file_sync_finetune?nw=nwuserrafa

rafapi · 2025-01-30T17:28:31Z

rafapi added 2 commits December 18, 2024 18:40

Multnode changes

3738d09

Merge fixed from deepspeed pr

e3c30ec

rafapi changed the base branch from main to multinode_deepspeed December 18, 2024 19:42

rafapi added 4 commits December 18, 2024 19:55

Add multinode env vars

40c211e

Move config

8382fe1

Add 70B config

cc0a37f

Remove dead code

103ca66

rafapi changed the base branch from multinode_deepspeed to main December 19, 2024 18:34

rafapi added 7 commits December 20, 2024 14:39

Only convert DS weights for inference

7f1f713

Reduce batch size

751f1e3

Run only on main process for now

36c853b

Add conversion flag

0f62cfb

Compose finetune command for multinode

46d0c88

Add default adam params

99415df

DS multinode config

b786d6b

rizar reviewed Dec 20, 2024

View reviewed changes

rafapi added 10 commits December 20, 2024 16:26

Merge branch 'main' into deepspeed_multinode

2a55ec4

Fix node sync

64306e1

Add some distributed training settings

81144a2

Fix details

4da0ea0

Optimize and clanup checkpoints

a81b18a

Bigger BS

41ad0f7

Distributed training

dc87445

Use dist training object

9432194

Handle convert to hf on final checkpoint

e5c8b20

Distributed utils

a0cf1dc

rafapi added 3 commits January 12, 2025 18:11

Adjust barriers

1cecdb5

Multi-node needs deepspeed

cf485ab

Simplify data generation

a7e6a63

rafapi added 5 commits January 15, 2025 15:40

Non-distributed case node sync

79fb78a

Sync annotated data from files

d1b1939

Fix sync dir race condition

2289656

Fix staging race condition

ab33e6d

Fix msgs

f7e9fe0

rafapi changed the title ~~[WIP] RL multinode~~ RL multinode Jan 15, 2025

rafapi added 3 commits January 15, 2025 19:26

Merge branch 'main' into deepspeed_multinode

bd3d5fd

Update run config

0727c79

Fix imports

479145e

rafapi added 6 commits January 15, 2025 19:48

Fix more imports

0f64ebc

Add timeouts to barrier call for consistency

ff1df42

Use correct api

4fec716

Increase timeout

5f155a2

Time operations

c76f7aa

Simplify file sync

e5ba842

rafapi changed the title ~~RL multinode~~ Synchronous Multi-node RL Jan 21, 2025

rafapi added 8 commits January 27, 2025 16:26

Fix throughput metrics collection

5dbde35

Simplify current thrp list

0b2dc06

Update details

6e3fafd

DeepSpeed config for 8b models

3322535

Remove old exports

c563e0a

Remove old loging code

2276d24

Improve metrics

e7acce4

Fix typo

a4f76a7

rafapi added 4 commits January 30, 2025 19:53

Merge branch 'main' into deepspeed_multinode

4bb593f

Separate orchestrator for multinode

ba54286

Fix f-string

1fa2018

Fix imports

a727ab6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronous Multi-node RL #152

Synchronous Multi-node RL #152

rafapi commented Dec 18, 2024 •

edited

Loading

rizar left a comment

rafapi commented Jan 8, 2025 •

edited

Loading

rafapi commented Jan 15, 2025

rafapi commented Jan 30, 2025

Synchronous Multi-node RL #152

Are you sure you want to change the base?

Synchronous Multi-node RL #152

Conversation

rafapi commented Dec 18, 2024 • edited Loading