Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs #6964

Open
wants to merge 64 commits into
base: master
Choose a base branch
from

Conversation

gyou2021
Copy link
Contributor

@gyou2021 gyou2021 commented Jan 21, 2025

Reduced the routed experts' AllReduce operation times per MoE layer to ONCE for the MoE models such as Qwen2-MoE and DeepSeek-V2. The results of all selected routed experts per layer on GPU/HPU cards will be gathered ONCE using the AllReduce operation, instead of gathering each selected routed expert individually or by the number of selected routed experts. This change will greatly increase performance.
In addition to modifying auto_tp.py, the following files should be updated: modeling_qwen2_moe.py and modeling_deepseek_v2.py. Add the following code after the weighted sum of the output of the selected experts per MoE layer.
if is_deepspeed_available():
from deepspeed import comm as dist
if dist.is_initialized():
dist.all_reduce(final_hidden_states, op=dist.ReduceOp.SUM)
Notes: final_hidden_states is the result of the weighted sum of the output of the selected experts per MoE layer.

@delock
Copy link
Collaborator

delock commented Jan 21, 2025

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

@gyou2021
Copy link
Contributor Author

Hi @gyou2021 , There is another PR for DeepSeek autotp in this link #6937 What is relationship of your PR to this previous PR regarding on functionality?

@Yejing-Lai can you take a look at this PR?

The difference lies in how the results of the weighted sum of routed experts per layer in the MoE are gathered. In my understanding, each selected routed expert per layer was gathered individually in #6937, meaning the gathering time was proportional to the number of selected routed experts per layer. In this PR, the results are gathered once per layer, regardless of the number of selected routed experts.

@delock
Copy link
Collaborator

delock commented Feb 11, 2025

Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization.

Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread.

@gyou2021
Copy link
Contributor Author

gyou2021 commented Feb 27, 2025

Hi @gyou2021 , thanks for this optimization. As we reviewed internally, need to figure out a way to turn on this optimization through knobs in model config.json file. In that way this optimization won't break workloads that does not utilize this optimization.

Plus: I think it will be even more user friendly if the colasced allredcuce could be injected by AutoTP. But as I reviewed with @gyou2021 , this seems tricky because no proper injection point can be captured by process nn.module. @tohtana @loadams @tjruwase @Yejing-Lai if you have an idea we can discuss in this thread.

Added support for the environment variable DS_MOE_TP_SINGLE_ALLREDUCE, which indicates whether the MoE expert all-reduce once optimization was enabled. If the value of DS_MOE_TP_SINGLE_ALLREDUCE is true, that is the optimization was enabled, the Auto_TP will execute corresponding operations to some linear layers.

@delock
Copy link
Collaborator

delock commented Feb 28, 2025

Hi @loadams, this PR is now ready to review. This PR allows user to explicitly turns off allreduce in experts for AutoTP tensor parallel inference, inorder to delay the allreduce until all experts are reduced locally. Thanks!

gyou2021 and others added 22 commits February 28, 2025 03:51
Signed-off-by: gyou2021 <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…peedai#6750)

FYI @loadams

a view operation was missing in some updates compared to the original
version
https://github.com/microsoft/DeepSpeed/blob/17ed7c77c58611a923a6c8d2a3d21d359cd046e8/deepspeed/sequence/layer.py#L56

add missing view operation.
The shape required for the view cannot be easily obtained in the current
function, so refactor layout params code.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…orm` (deepspeedai#6931)

- [x] Update PR since `torch.norm` and `torch.linalg.norm` have
[different function
signatures](https://pytorch.org/docs/stable/generated/torch.linalg.norm.html#torch.linalg.norm).
- [x] Check if there are any numeric differences between the functions.
- [x] Determine why there appear to be performance differences from
others [here](pytorch/pytorch#136360).
- [x] Update to `torch.linalg.vectornorm`
Follow up PR handles these in the comm folder: deepspeedai#6960

Signed-off-by: gyou2021 <[email protected]>
Following discussion in
[PR-6670](deepspeedai#6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |

Signed-off-by: gyou2021 <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.3
Author           - @loadams

Co-authored-by: loadams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…ead of Raising Error (deepspeedai#6979)

This pull request addresses an issue in setup_env_ranks where, under
certain conditions, the function raises an error instead of setting the
necessary MPI-related environment variables (LOCAL_RANK, RANK, and
WORLD_SIZE). The intended behavior is to properly map Open MPI variables
(OMPI_COMM_WORLD_*) to the variables expected by DeepSpeed/PyTorch, but
the code currently raises an EnvironmentError if these Open MPI
variables are not found.

With this fix, setup_env_ranks will:

- Correctly detect and map the required Open MPI environment variables.
- Only raise an error if there is genuinely no valid way to obtain rank
information from the environment (e.g., both Open MPI variables and any
fallback mechanism are unavailable).

Fix deepspeedai#6895

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…h 2.6) (deepspeedai#6982)

Fixes deepspeedai#6984.

The workflow was pulling the updated torch 2.6, which caused CI
failures. This keeps us on torch 2.5 for now, since installing
torchvision as a dependency later on was pulling torch 2.6 with it which
was unintended.

This PR also unsets NCCL_DEBUG to avoid a large print out in the case of
any errors.

Signed-off-by: gyou2021 <[email protected]>
…6974)

As discussed in
[PR-6918](deepspeedai#6918), padding can
occur on multiple ranks with large DP degrees.

For example, with:
- Flattened tensor size: 266240
- DP degree: 768
- Alignment: 1536
- Required padding: 1024 (1536 * 174 - 266240)
- Per-rank partition size: 348 (1536 * 174 / 768)
- The padding occurs on last three ranks.

This PR removes the single-rank padding assumption for more general
cases.

---------

Co-authored-by: Sam Foreman <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…deepspeedai#6967)

- Issues with nv-sd updates, will follow up with a subsequent PR

Signed-off-by: gyou2021 <[email protected]>
NVIDIA Blackwell GPU generation has number 10. The SM code and
architecture should be `100`, but the current code generates `1.`,
because it expects a 2 characters string.

This change modifies the logic to consider it as a string that contains
a `.`, hence splits the string and uses the array of strings.

Signed-off-by: Fabien Dupont <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Signed-off-by: Olatunji Ruwase <[email protected]>
Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: Fabien Dupont <[email protected]>
Co-authored-by: Fabien Dupont <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
wukong1992 and others added 20 commits February 28, 2025 05:25
bf16 with moe refresh optimizer state from bf16 ckpt will raise
IndexError: list index out of range

Signed-off-by: shaomin <[email protected]>
Co-authored-by: shaomin <[email protected]>
Co-authored-by: Hongwei Chen <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.4
Author           - @loadams

Co-authored-by: loadams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
@jeffra and I fixed this many years ago, so bringing this doc to a
correct state.

---------

Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Description
This PR includes Tecorigin SDAA accelerator support.
With this PR, DeepSpeed supports SDAA as backend for training tasks.

---------

Signed-off-by: siqi <[email protected]>
Co-authored-by: siqi <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
More information on libuv in pytorch:
https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html
Issue tracking the prevalence of the error on Windows (unresolved at the
time of this PR): pytorch/pytorch#139990
LibUV github: https://github.com/libuv/libuv

Windows error:
```
  File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: use_libuv was requested but PyTorch was build without libuv support
```

use_libuv isn't well supported on Windows in pytorch <2.4, so we need to
guard around this case.

---------

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
@fukun07 and I discovered a bug when using the `offload_states` and
`reload_states` APIs of the Zero3 optimizer. When using grouped
parameters (for example, in weight decay or grouped lr scenarios), the
order of the parameters mapping in `reload_states`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953))
does not correspond with the initialization of `self.lp_param_buffer`
([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)),
which leads to misaligned parameter loading. This issue was overlooked
by the corresponding unit tests
([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)),
so we fixed the bug in our PR and added the corresponding unit tests.

---------

Signed-off-by: Wei Wu <[email protected]>
Co-authored-by: Masahiro Tanaka <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Following changes in Pytorch trace rules , my previous PR to avoid graph
breaks caused by logger is no longer relevant. So instead I've added
this functionality to torch dynamo -
pytorch/pytorch@16ea0dd
This commit allows the user to config torch to ignore logger methods and
avoid associated graph breaks.

To enable ignore logger methods -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
To ignore logger methods except for a specific method / methods (for
example, info and isEnabledFor) -
os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1"
and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info,
isEnabledFor"

Signed-off-by: ShellyNR <[email protected]>
Co-authored-by: snahir <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
The partition tensor doesn't need to move to the current device when
meta load is used.

Signed-off-by: Lai, Yejing <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…t` (deepspeedai#7069)

With future changes coming to pip/python/etc, we need to modify to no
longer call `python setup.py ...` and replace it instead:
https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted

![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d)

This means we need to install the build package which is added here as
well.

Additionally, we pass the `--sdist` flag to only build the sdist rather
than the wheel as well here.

---------

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Add deepseekv3 autotp.

Signed-off-by: Lai, Yejing <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
Fixes: deepspeedai#7082

---------

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…, which indicates whether moe expert all-reduce once optimization was used.

Signed-off-by: gyou2021 <[email protected]>
Latest transformers causes failures when cpu-torch-latest test, so we
pin it for now to unblock other PRs.

---------

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
…/runner (deepspeedai#7086)

Signed-off-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
These jobs haven't been run in a long time and were originally used when
compatibility with torch <2 was more important.

Signed-off-by: Logan Adams <[email protected]>
Signed-off-by: gyou2021 <[email protected]>
@gyou2021 gyou2021 force-pushed the autoTP_Qwen2Moe_DeepSeekv2 branch from f3c6b43 to b3c64dd Compare February 28, 2025 05:34
@gyou2021 gyou2021 changed the title Enabled high-performance Automatic Tensor Parallelism (auto TP) for the Qwen2-MoE and DeepSeek-V2 models on multiple GPUs/HPUs Enabled high-performance Automatic Tensor Parallelism (auto TP) for the MoE models on multiple GPUs/HPUs Feb 28, 2025
Signed-off-by: gyou2021 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.