Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

gyou2021 · 2025-01-21T08:18:52Z

Reduced the routed experts' AllReduce operation times per MoE layer to ONCE for the MoE models such as Qwen2-MoE and DeepSeek-V2. The results of all selected routed experts per layer on GPU/HPU cards will be gathered ONCE using the AllReduce operation, instead of gathering each selected routed expert individually or by the number of selected routed experts. This change will greatly increase performance.
In addition to modifying auto_tp.py, the following files should be updated: modeling_qwen2_moe.py and modeling_deepseek_v2.py. Add the following code after the weighted sum of the output of the selected experts per MoE layer.
if is_deepspeed_available():
from deepspeed import comm as dist
if dist.is_initialized():
dist.all_reduce(final_hidden_states, op=dist.ReduceOp.SUM)
Notes: final_hidden_states is the result of the weighted sum of the output of the selected experts per MoE layer.

delock · 2025-01-21T08:46:13Z

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

deepspeed/module_inject/auto_tp.py

gyou2021 · 2025-01-21T09:18:41Z

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

The difference lies in how the results of the weighted sum of routed experts per layer in the MoE are gathered. In my understanding, each selected routed expert per layer was gathered individually in #6937, meaning the gathering time was proportional to the number of selected routed experts per layer. In this PR, the results are gathered once per layer, regardless of the number of selected routed experts.

delock · 2025-02-11T01:09:58Z

Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization.

Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread.

gyou2021 · 2025-02-27T10:06:29Z

Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization.

Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread.

Added support for the environment variable DS_MOE_TP_SINGLE_ALLREDUCE, which indicates whether the MoE expert all-reduce once optimization was enabled. If the value of DS_MOE_TP_SINGLE_ALLREDUCE is true, that is the optimization was enabled, the Auto_TP will execute corresponding operations to some linear layers.

delock · 2025-02-28T03:40:36Z

Hi @loadams, this PR is now ready to review. This PR allows user to explicitly turns off allreduce in experts for AutoTP tensor parallel inference, inorder to delay the allreduce until all experts are reduced locally. Thanks!

…MoE and DeepSeek-V2 models. Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

@loadams

…peedai#6750) FYI @loadams a view operation was missing in some updates compared to the original version https://github.com/microsoft/DeepSpeed/blob/17ed7c77c58611a923a6c8d2a3d21d359cd046e8/deepspeed/sequence/layer.py#L56 add missing view operation. The shape required for the view cannot be easily obtained in the current function， so refactor layout params code. --------- Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…orm` (deepspeedai#6931) - [x] Update PR since `torch.norm` and `torch.linalg.norm` have [different function signatures](https://pytorch.org/docs/stable/generated/torch.linalg.norm.html#torch.linalg.norm). - [x] Check if there are any numeric differences between the functions. - [x] Determine why there appear to be performance differences from others [here](pytorch/pytorch#136360). - [x] Update to `torch.linalg.vectornorm` Follow up PR handles these in the comm folder: deepspeedai#6960 Signed-off-by: gyou2021 <[email protected]>

Following discussion in [PR-6670](deepspeedai#6670), the explict upcast is much more efficient than implicit upcast, this PR is to replace implicit upcast with explict one. The results on 3B model are shown below: | Option | BWD (ms) | Speed up | |------------|-----|------| | Before PR-6670 | 25603.30 | 1x | | After PR-6670 | 1174.31 | 21.8X | | After this PR| 309.2 | 82.8X | Signed-off-by: gyou2021 <[email protected]>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.3 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Fix deepspeedai#4998 Signed-off-by: gyou2021 <[email protected]>

…ort. (deepspeedai#6971) Nvidia GDS [does not support Windows](https://developer.nvidia.com/gpudirect-storage). Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

…epspeedai#6932) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

…ead of Raising Error (deepspeedai#6979) This pull request addresses an issue in setup_env_ranks where, under certain conditions, the function raises an error instead of setting the necessary MPI-related environment variables (LOCAL_RANK, RANK, and WORLD_SIZE). The intended behavior is to properly map Open MPI variables (OMPI_COMM_WORLD_*) to the variables expected by DeepSpeed/PyTorch, but the code currently raises an EnvironmentError if these Open MPI variables are not found. With this fix, setup_env_ranks will: - Correctly detect and map the required Open MPI environment variables. - Only raise an error if there is genuinely no valid way to obtain rank information from the environment (e.g., both Open MPI variables and any fallback mechanism are unavailable). Fix deepspeedai#6895 Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…h 2.6) (deepspeedai#6982) Fixes deepspeedai#6984. The workflow was pulling the updated torch 2.6, which caused CI failures. This keeps us on torch 2.5 for now, since installing torchvision as a dependency later on was pulling torch 2.6 with it which was unintended. This PR also unsets NCCL_DEBUG to avoid a large print out in the case of any errors. Signed-off-by: gyou2021 <[email protected]>

…6974) As discussed in [PR-6918](deepspeedai#6918), padding can occur on multiple ranks with large DP degrees. For example, with: - Flattened tensor size: 266240 - DP degree: 768 - Alignment: 1536 - Required padding: 1024 (1536 * 174 - 266240) - Per-rank partition size: 348 (1536 * 174 / 768) - The padding occurs on last three ranks. This PR removes the single-rank padding assumption for more general cases. --------- Co-authored-by: Sam Foreman <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…deepspeedai#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: gyou2021 <[email protected]>

NVIDIA Blackwell GPU generation has number 10. The SM code and architecture should be `100`, but the current code generates `1.`, because it expects a 2 characters string. This change modifies the logic to consider it as a string that contains a `.`, hence splits the string and uses the array of strings. Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

bf16 with moe refresh optimizer state from bf16 ckpt will raise IndexError: list index out of range Signed-off-by: shaomin <[email protected]> Co-authored-by: shaomin <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: gyou2021 <[email protected]>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.4 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

@jeffra

@jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

More information on libuv in pytorch: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html Issue tracking the prevalence of the error on Windows (unresolved at the time of this PR): pytorch/pytorch#139990 LibUV github: https://github.com/libuv/libuv Windows error: ``` File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store return TCPStore( ^^^^^^^^^ RuntimeError: use_libuv was requested but PyTorch was build without libuv support ``` use_libuv isn't well supported on Windows in pytorch <2.4, so we need to guard around this case. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

@fukun07

@fukun07 and I discovered a bug when using the `offload_states` and `reload_states` APIs of the Zero3 optimizer. When using grouped parameters (for example, in weight decay or grouped lr scenarios), the order of the parameters mapping in `reload_states` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953)) does not correspond with the initialization of `self.lp_param_buffer` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)), which leads to misaligned parameter loading. This issue was overlooked by the corresponding unit tests ([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)), so we fixed the bug in our PR and added the corresponding unit tests. --------- Signed-off-by: Wei Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Following changes in Pytorch trace rules , my previous PR to avoid graph breaks caused by logger is no longer relevant. So instead I've added this functionality to torch dynamo - pytorch/pytorch@16ea0dd This commit allows the user to config torch to ignore logger methods and avoid associated graph breaks. To enable ignore logger methods - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" To ignore logger methods except for a specific method / methods (for example, info and isEnabledFor) - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info, isEnabledFor" Signed-off-by: ShellyNR <[email protected]> Co-authored-by: snahir <[email protected]> Signed-off-by: gyou2021 <[email protected]>

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…t` (deepspeedai#7069) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted ![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d) This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072 Signed-off-by: gyou2021 <[email protected]>

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Fixes: deepspeedai#7082 --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…, which indicates whether moe expert all-reduce once optimization was used. Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

…/runner (deepspeedai#7086) Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: gyou2021 <[email protected]>

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Signed-off-by: gyou2021 <[email protected]>

deepspeed/module_inject/auto_tp.py

Cleaned the code.

gyou2021 requested review from hwchen2017 and loadams as code owners January 21, 2025 08:18

delock reviewed Jan 21, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Outdated Show resolved Hide resolved

gyou2021 and others added 22 commits February 28, 2025 03:51

Reduced the experts allreduce number per layer to ONCE for the Qwen2-…

c9b12af

…MoE and DeepSeek-V2 models. Signed-off-by: gyou2021 <[email protected]>

Fixed format

590ea36

Signed-off-by: gyou2021 <[email protected]>

Removed print

889c275

Signed-off-by: gyou2021 <[email protected]>

Fix a bug about set.

2ec6c34

Signed-off-by: gyou2021 <[email protected]>

Update version.txt after 0.16.3 release (deepspeedai#6965)

deb09a3

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.3 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Precisely track nvme optimizer offload (deepspeedai#6963)

128d436

Fix deepspeedai#4998 Signed-off-by: gyou2021 <[email protected]>

Update build_win.bat script to exclue GDS op as it lacks Windows supp…

864472b

…ort. (deepspeedai#6971) Nvidia GDS [does not support Windows](https://developer.nvidia.com/gpudirect-storage). Signed-off-by: gyou2021 <[email protected]>

Add CUDA 12.8 support and comment on CUDA 12.7 (deepspeedai#6975)

1ac398c

Signed-off-by: gyou2021 <[email protected]>

Update torch versions to support 2.6 (deepspeedai#6977)

eda53d8

Signed-off-by: gyou2021 <[email protected]>

generalize deepspeed linear and implement it for non cuda systems (de…

112a7c6

…epspeedai#6932) Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update recommended Windows whl building versions (deepspeedai#6983)

7d2c5fe

Signed-off-by: gyou2021 <[email protected]>

Use ds-specific module id to avoid conflicts (deepspeedai#6847)

e235921

Fix deepspeedai#6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (…

f5e9796

…deepspeedai#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: gyou2021 <[email protected]>

Update GH org references (deepspeedai#6998)

0e57fa0

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update CNAME

e86c0c3

Signed-off-by: gyou2021 <[email protected]>

wukong1992 and others added 20 commits February 28, 2025 05:25

Update version.txt after 0.16.4 release (deepspeedai#7063)

4a4ff9b

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.4 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

fix an outdated doc wrt CUDA_VISIBLE_DEVICES (deepspeedai#7058)

e5eda47

@jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update README with info on newest accelerator (deepspeedai#7065)

17f544c

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Fix TOCTOU issues, switch to fstat (deepspeedai#7067)

0b289a2

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Revert "Handle special case of libuv for Windows (deepspeedai#7064)" (d…

4cbc52c

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072 Signed-off-by: gyou2021 <[email protected]>

Add DeepseekV3 AutoTP. (deepspeedai#7045)

586e436

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Improve inference tutorial docs (deepspeedai#7083)

5e379ad

Fixes: deepspeedai#7082 --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Added support for the environment variable DS_MOE_EXPERTS_REDUCE_ONCE…

13bf866

…, which indicates whether moe expert all-reduce once optimization was used. Signed-off-by: gyou2021 <[email protected]>

Changed env variable name to 'DS_MOE_TP_SINGLE_ALLREDUCE'

d5115be

Signed-off-by: gyou2021 <[email protected]>

Pin transformers version on tests that use latest. (deepspeedai#7085)

f0044cb

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update README.md with ICS '23 MoE paper link (deepspeedai#7087)

16ad5fd

Co-authored-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Update parallelism for nv-torch-latest/nightly tests due to more GPUs…

47d4420

…/runner (deepspeedai#7086) Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: gyou2021 <[email protected]>

Remove workflows for very old torch versions (deepspeedai#7090)

b3c64dd

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <[email protected]> Signed-off-by: gyou2021 <[email protected]>

gyou2021 force-pushed the autoTP_Qwen2Moe_DeepSeekv2 branch from f3c6b43 to b3c64dd Compare February 28, 2025 05:34

gyou2021 requested review from tjruwase, tohtana, jomayeri and GuanhuaWang as code owners February 28, 2025 05:34

gyou2021 changed the title ~~Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs~~ Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs Feb 28, 2025

Fixed conflicts

9b1fe98

Signed-off-by: gyou2021 <[email protected]>

hwchen2017 reviewed Mar 3, 2025

View reviewed changes

deepspeed/module_inject/auto_tp.py Outdated Show resolved Hide resolved

gyou2021 and others added 2 commits March 5, 2025 22:42

Update auto_tp.py

6b96dd9

Cleaned the code.

Merge branch 'master' into autoTP_Qwen2Moe_DeepSeekv2

e7883e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

gyou2021 commented Jan 21, 2025 •

edited

Loading

delock commented Jan 21, 2025

gyou2021 commented Jan 21, 2025

delock commented Feb 11, 2025

gyou2021 commented Feb 27, 2025 •

edited

Loading

delock commented Feb 28, 2025

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

Are you sure you want to change the base?

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

Conversation

gyou2021 commented Jan 21, 2025 • edited Loading

delock commented Jan 21, 2025

gyou2021 commented Jan 21, 2025

delock commented Feb 11, 2025

gyou2021 commented Feb 27, 2025 • edited Loading

delock commented Feb 28, 2025

gyou2021 commented Jan 21, 2025 •

edited

Loading

gyou2021 commented Feb 27, 2025 •

edited

Loading