-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919
Conversation
…uce for CPU backend (single node)
@delock, thanks for the PR. I am concerned that
Thanks! |
Thanks for the comments @tjruwase , I think |
Will use |
…E in test_distributed
Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <[email protected]>
In optimize DeepSpeed AutoTP inference workload, allreduce latency is critical to inference scaling. This PR introduce a new deepspeed.comm interface
allreduce_low_latency
to allow communication backend to implement a low latency version of allreduce, which allows communication backend or library to optimize allreduce performance with different strategy.In this PR we implemented a low latency allreduce on CCLBackend for CPU backend. Experiment shows for very small messages, low latency allreduce bring more than 2x performance boost, compared to calling oneCCL library which is optimized for large message size with concurrency.
The opportunity of low latency allreduce comes from the fact that allreduce in training is different from allreduce in inference tensor parallel. In training, allreduce usually has:
Optimize of large message size and concurrency improves communcation in training but also complicate the software infrastructure. In inference the message size is small and there is no need for concurrency. Thus these software design won't help allreduce latency but might hurt it. Seperate the path for latency and throughput gives an opportunity to optimize for latency in inference scenario.