Distributed Reshape Implementation #18068

wschin · 2023-10-23T22:40:50Z

This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in TestDistributedReshape in onnxruntime_test_distributed.py. This PR implements algorithms to compute the categories below.

All inputs and outputs are replica, so it's computed like a normal Reshape.
Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., [batch, seq, hidden] -> [batch x seq, hidden].
Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., [batch x seq, hidden] -> [batch, seq, hidden].

Review guideline:

Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR Support per-tensor device mesh at op level #18025.
First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape.
Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape.
For DistributedReshape, check its ComputeInternal for the 3 categories mentioned above.

onnxruntime/contrib_ops/cuda/collective/distributed_reshape.h

onnxruntime/contrib_ops/cuda/collective/distributed_reshape.cc

onnxruntime/test/python/onnxruntime_test_distributed.py

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume same device mesh per op. At each operator, we replace a single operator-level device mesh - `device_mesh_shapes` - `device_mesh_elements` with per-tensor device meshes - `input_device_mesh_shapes` (input_device_mesh_shapes[i] is the device mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3 devices) - `input_device_mesh_elements` (input_device_mesh_elements[i] is the flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if you have 3 devices in that mesh) - `output_device_mesh_shapes` - `output_device_mesh_elements` Check out the change in onnxruntime_test_distributed.py for examples. It's also heavily used in #18068's `onnxruntime_test_distributed.py` change.

Add tests Format

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume same device mesh per op. At each operator, we replace a single operator-level device mesh - `device_mesh_shapes` - `device_mesh_elements` with per-tensor device meshes - `input_device_mesh_shapes` (input_device_mesh_shapes[i] is the device mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3 devices) - `input_device_mesh_elements` (input_device_mesh_elements[i] is the flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if you have 3 devices in that mesh) - `output_device_mesh_shapes` - `output_device_mesh_elements` Check out the change in onnxruntime_test_distributed.py for examples. It's also heavily used in microsoft#18068's `onnxruntime_test_distributed.py` change.

This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR microsoft#18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.

github-advanced-security bot found potential problems Oct 23, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/collective/distributed_reshape.h Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/cuda/collective/distributed_reshape.cc Fixed Show fixed Hide fixed

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 23, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 24, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 25, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 25, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

wschin force-pushed the wechi/d-reshape branch from f2e05c9 to 409210e Compare October 25, 2023 03:01

wschin mentioned this pull request Oct 25, 2023

Support per-tensor device mesh at op level #18025

Merged

wschin force-pushed the wechi/d-reshape branch 2 times, most recently from d3a5ba2 to b0a7569 Compare October 25, 2023 23:07

wschin marked this pull request as ready for review October 26, 2023 00:11

wschin requested a review from souptc October 26, 2023 00:11

wschin force-pushed the wechi/d-reshape branch from b0a7569 to 47b3312 Compare October 26, 2023 17:27

wschin added 2 commits October 26, 2023 14:52

Build skeleton of DistributedReshape

25fabb5

Two-axis fusion & decomposition for llama

baf8f35

Add tests Format

wschin force-pushed the wechi/d-reshape branch from 47b3312 to baf8f35 Compare October 26, 2023 21:52

souptc approved these changes Oct 27, 2023

View reviewed changes

souptc merged commit 9c32310 into main Oct 27, 2023

souptc deleted the wechi/d-reshape branch October 27, 2023 05:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Reshape Implementation #18068

Distributed Reshape Implementation #18068

wschin commented Oct 23, 2023 •

edited

Loading

Distributed Reshape Implementation #18068

Distributed Reshape Implementation #18068

Conversation

wschin commented Oct 23, 2023 • edited Loading

wschin commented Oct 23, 2023 •

edited

Loading