Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

Merged
merged 18 commits into from
Jul 19, 2023

Conversation

delock
Copy link
Collaborator

@delock delock commented Jul 10, 2023

In optimize DeepSpeed AutoTP inference workload, allreduce latency is critical to inference scaling. This PR introduce a new deepspeed.comm interface allreduce_low_latency to allow communication backend to implement a low latency version of allreduce, which allows communication backend or library to optimize allreduce performance with different strategy.

In this PR we implemented a low latency allreduce on CCLBackend for CPU backend. Experiment shows for very small messages, low latency allreduce bring more than 2x performance boost, compared to calling oneCCL library which is optimized for large message size with concurrency.

The opportunity of low latency allreduce comes from the fact that allreduce in training is different from allreduce in inference tensor parallel. In training, allreduce usually has:

  1. Large message size
  2. Concurrency
    Optimize of large message size and concurrency improves communcation in training but also complicate the software infrastructure. In inference the message size is small and there is no need for concurrency. Thus these software design won't help allreduce latency but might hurt it. Seperate the path for latency and throughput gives an opportunity to optimize for latency in inference scenario.

@delock delock changed the title (CPU) Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) Jul 10, 2023
@tjruwase
Copy link
Contributor

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

  1. small_message_allreduce
  2. inference_allreduce

Thanks!

@delock
Copy link
Collaborator Author

delock commented Jul 12, 2023

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

  1. small_message_allreduce
  2. inference_allreduce

Thanks!

Thanks for the comments @tjruwase , I think inference_allreduce would be better since in training scenario concurrency could be assumed. Would make the change.

@delock
Copy link
Collaborator Author

delock commented Jul 14, 2023

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

  1. small_message_allreduce
  2. inference_allreduce

Thanks!

Thanks for the comments @tjruwase , I think inference_allreduce would be better since in training scenario concurrency could be assumed. Would make the change.

Will use inference_all_reduce to keep naming consistency with all_reduce.

@tjruwase tjruwase added this pull request to the merge queue Jul 19, 2023
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 19, 2023
@tjruwase tjruwase added this pull request to the merge queue Jul 19, 2023
Merged via the queue into deepspeedai:master with commit 1bc3b78 Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants