Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] More fine-grained distributed strategies for RLHF training #884

Open
youshaox opened this issue Apr 3, 2024 · 0 comments
Open

Comments

@youshaox
Copy link

youshaox commented Apr 3, 2024

Is your feature request related to a problem? Please describe.
We find that the generation stage of RLHF pipeline is time-consuming during the current training process. This is because the four models (Actor, Critic, Reward, and Ref) are all colocated on the same devices, utilizing a "Flattening" strategy. This results in that both training and inference runtime are mixed in the current procedure. It disables the training or inference specialized optimization methods. Also, a significant amount of memory is occupied by models, but they are idle in generation stage of actor model. Therefore, instead of collocating these four models on all devices, more fine-grained placement strategy could be utilized.

Describe the solution you'd like
Our team is planning to open-source our implementation of APP (https://arxiv.org/pdf/2312.11819.pdf) and contribute it to the codebase. Specifically, we are proposing two fine-grained model placement strategies:
A Separation strategy that separates the training and inference runtime of the RLHF pipeline with additional shadow models. This enables the adoption of inference-optimized techniques such as vLLM and intra-node tensor parallelism to accelerate the time-cost generation stage. This enables different distributed stragies during the generation stage compared with training stage.
An Interleaving strategy that helps reduce memory redundancy and communication costs in RLHF training by placing models without dependencies on exclusive devices with careful orchestration. For example, inference models like the reward model and reference model could be placed on separate devices. This approach enables the reduction of memory redundancy using the DDP or ZeRO 1-2 by decreasing the scale of participating nodes.

Describe alternatives you've considered
N/A

Additional context
Thank you for sharing the deepspeed-chat with the community! It has been an essential infrastructure, providing an easy-to-use solution for training InstructGPT-like models. Recently, we have made some improvements to further enhance training performance while maintaining the simplicity of usage. These improvements have already been implemented in the RLHF training at Ant Group. In order to share our efforts with the deepspeed-chat community, we would like to integrate our implementation into DeepSpeedExamples codebase.

To facilitate discussions and minimize potential conflicts of interest, we have created this issue to engage in conversations about the proposed modifications. We look forward to collaborating with the community on this matter.

Please feel free to comment here or reach out via email ([email protected]). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant