[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

delock · 2023-07-10T10:02:45Z

In optimize DeepSpeed AutoTP inference workload, allreduce latency is critical to inference scaling. This PR introduce a new deepspeed.comm interface allreduce_low_latency to allow communication backend to implement a low latency version of allreduce, which allows communication backend or library to optimize allreduce performance with different strategy.

In this PR we implemented a low latency allreduce on CCLBackend for CPU backend. Experiment shows for very small messages, low latency allreduce bring more than 2x performance boost, compared to calling oneCCL library which is optimized for large message size with concurrency.

The opportunity of low latency allreduce comes from the fact that allreduce in training is different from allreduce in inference tensor parallel. In training, allreduce usually has:

Large message size
Concurrency
Optimize of large message size and concurrency improves communcation in training but also complicate the software infrastructure. In inference the message size is small and there is no need for concurrency. Thus these software design won't help allreduce latency but might hurt it. Seperate the path for latency and throughput gives an opportunity to optimize for latency in inference scenario.

…uce for CPU backend (single node)

tjruwase · 2023-07-11T17:54:43Z

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

small_message_allreduce
inference_allreduce

Thanks!

delock · 2023-07-12T00:22:03Z

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

small_message_allreduce

inference_allreduce

Thanks!

Thanks for the comments @tjruwase , I think inference_allreduce would be better since in training scenario concurrency could be assumed. Would make the change.

csrc/cpu/comm/ccl.cpp

delock · 2023-07-14T04:12:27Z

@delock, thanks for the PR. I am concerned that allreduce_low_latency could be a misleading name since the low latency is only achieved for small message sizes and low concurrency. A user could mistakenly think that this function could help improve allreduce performance for training as well. Can we explore names that more descriptive of the target uses, such as:

small_message_allreduce

inference_allreduce

Thanks!

Thanks for the comments @tjruwase , I think inference_allreduce would be better since in training scenario concurrency could be assumed. Would make the change.

Will use inference_all_reduce to keep naming consistency with all_reduce.

…sistency

…E in test_distributed

tests/unit/comm/test_dist.py

Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <[email protected]>

use allreduce_low_latency for AutoTP and implement low latency allred…

3b7482d

…uce for CPU backend (single node)

delock requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners July 10, 2023 10:02

delock changed the title ~~(CPU) Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node)~~ [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) Jul 10, 2023

delock mentioned this pull request Jul 10, 2023

(Do not merge) (CPU) aggregation of few recent fixes/optimizations #3920

Closed

25 tasks

delock and others added 4 commits July 11, 2023 13:52

add fp32 support for SHM allreduce

77fe007

avoid assertion for FP16 data type

6cdcd38

Merge branch 'up-master' into gma/ccl_low_latency

1249f9a

fix format

0078b7e

delock and others added 3 commits July 12, 2023 09:06

change 'allreduce_low_latency' to 'inference_allreduce'

69bcb4f

Merge branch 'master' into gma/ccl_low_latency

929fee1

Merge branch 'master' into gma/ccl_low_latency

ed01c6d

tjruwase reviewed Jul 13, 2023

View reviewed changes

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Jul 13, 2023

View reviewed changes

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Jul 13, 2023

View reviewed changes

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved

tjruwase reviewed Jul 13, 2023

View reviewed changes

csrc/cpu/comm/ccl.cpp Outdated Show resolved Hide resolved

Fix according to comments

05b5f3e

delock and others added 6 commits July 14, 2023 00:15

change inference_allreduce to inference_all_reduce to keep naming con…

3b3fcab

…sistency

check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZ…

26b3806

…E in test_distributed

fix format

4c352a3

Fix format error

bf5fc19

Merge branch 'master' into gma/ccl_low_latency

077a0bb

Merge branch 'master' into gma/ccl_low_latency

c1324da

mrwyattii reviewed Jul 17, 2023

View reviewed changes

tests/unit/comm/test_dist.py Outdated Show resolved Hide resolved

delock and others added 3 commits July 18, 2023 23:23

Update tests/unit/comm/test_dist.py

7493074

Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <[email protected]>

Merge branch 'master' into gma/ccl_low_latency

8c60288

Merge branch 'master' into gma/ccl_low_latency

866c4f0

tjruwase approved these changes Jul 19, 2023

View reviewed changes

tjruwase added this pull request to the merge queue Jul 19, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 19, 2023

tjruwase added this pull request to the merge queue Jul 19, 2023

Merged via the queue into deepspeedai:master with commit 1bc3b78 Jul 19, 2023

delock mentioned this pull request Jul 24, 2023

add lm_head and embed_out tensor parallel #3962

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

delock commented Jul 10, 2023

tjruwase commented Jul 11, 2023

delock commented Jul 12, 2023

delock commented Jul 14, 2023

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) #3919

Conversation

delock commented Jul 10, 2023

tjruwase commented Jul 11, 2023

delock commented Jul 12, 2023

delock commented Jul 14, 2023