Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous Multi-node RL #152

Open
wants to merge 64 commits into
base: main
Choose a base branch
from
Open

Synchronous Multi-node RL #152

wants to merge 64 commits into from

Conversation

rafapi
Copy link
Collaborator

@rafapi rafapi commented Dec 18, 2024

Synchronous Multi-node RL

Distributed Training Enhancements

  • Robust distributed training support with multiple nodes
  • New DistributedManager class to handle distributed operations
  • File-based synchronization mechanisms for non-distributed phases
  • Handling of multi-node NCCL configurations

DeepSpeed Integration

  • New DeepSpeed configuration for multi-node scenarios (ds_multinode.json)
  • Fixed DeepSpeed checkpoint handling and model saving
  • Enhanced optimizer configurations for distributed training
  • Added support for CPU Adam optimizer in distributed settings

Data Processing Improvements

  • Implemented data splitting across nodes for better load distribution
  • Added sample gathering from multiple ranks
  • Improved error handling and retry mechanisms for data loading

Memory Management

  • Added GPU resource cleanup functionality
  • Improved memory status monitoring
  • Added CUDA cache management

Logging and Monitoring

  • Enhanced W&B logging for distributed training
  • Added rank-aware logging
  • Improved error reporting across distributed processes
  • Added execution time tracking for different training phases

Synchronization Mechanisms

  • Added robust barrier implementations with timeout and retry logic
  • Implemented file-based synchronization for non-distributed phases
  • Added checks to ensure all nodes complete critical sections
  • Enhanced coordination between nodes during data processing and training

Best squeeze of throughput

image

@rafapi rafapi changed the base branch from main to multinode_deepspeed December 18, 2024 19:42
@rafapi rafapi changed the base branch from multinode_deepspeed to main December 19, 2024 18:34
Copy link
Collaborator

@rizar rizar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Please see my comments below.

examples/rl_gsm8k/orchestrate_rl.py Outdated Show resolved Hide resolved
examples/rl_gsm8k/utils.py Outdated Show resolved Hide resolved
examples/rl_gsm8k/utils.py Outdated Show resolved Hide resolved
examples/rl_gsm8k/utils.py Show resolved Hide resolved
examples/rl_gsm8k/orchestrate_rl.py Outdated Show resolved Hide resolved
@rafapi
Copy link
Collaborator Author

rafapi commented Jan 8, 2025

@rafapi rafapi changed the title [WIP] RL multinode RL multinode Jan 15, 2025
@rafapi rafapi changed the title RL multinode Synchronous Multi-node RL Jan 21, 2025
@rafapi
Copy link
Collaborator Author

rafapi commented Jan 30, 2025

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants