Support per-tensor device mesh at op level #18025

wschin · 2023-10-19T02:43:00Z

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume same device mesh per op. At each operator, we replace a single operator-level device mesh

device_mesh_shapes
device_mesh_elements

with per-tensor device meshes

input_device_mesh_shapes (input_device_mesh_shapes[i] is the device mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3 devices)
input_device_mesh_elements (input_device_mesh_elements[i] is the flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if you have 3 devices in that mesh)
output_device_mesh_shapes
output_device_mesh_elements

Check out the change in onnxruntime_test_distributed.py for examples. It's also heavily used in #18068's onnxruntime_test_distributed.py change.

onnxruntime/test/python/onnxruntime_test_distributed.py

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume since device mesh per op. Lint resolve style warnings

This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR #18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume same device mesh per op. At each operator, we replace a single operator-level device mesh - `device_mesh_shapes` - `device_mesh_elements` with per-tensor device meshes - `input_device_mesh_shapes` (input_device_mesh_shapes[i] is the device mesh's shape for the i-th input, e.g., "[3]" for 1-D mesh with 3 devices) - `input_device_mesh_elements` (input_device_mesh_elements[i] is the flattened device mesh elements for the i-th input; e.g., "[0, 1, 2]" if you have 3 devices in that mesh) - `output_device_mesh_shapes` - `output_device_mesh_elements` Check out the change in onnxruntime_test_distributed.py for examples. It's also heavily used in microsoft#18068's `onnxruntime_test_distributed.py` change.

This DistributedReshape aims at supporting all sharding patterns encountered in llama 2. All patterns found are tested in `TestDistributedReshape` in `onnxruntime_test_distributed.py`. This PR implements algorithms to compute the categories below. - All inputs and outputs are replica, so it's computed like a normal Reshape. - Two-axis fusion (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch, seq, hidden] -> [batch x seq, hidden]`. - Two-axis decomposition (if any of the inputs and outputs are sharded). This category convers, e.g., `[batch x seq, hidden] -> [batch, seq, hidden]`. Review guideline: - Ignore the changes in sharding_spec.h and sharding_spec.cc since they come from another PR microsoft#18025. - First, read onnxruntime_test_distributed.py to get familiar with the input/output of DistributedReshape. - Second, check the new APIs in reshape.h/reshape.cc to expose CUDA Reshape kernel to DistributedReshape. - For DistributedReshape, check its `ComputeInternal` for the 3 categories mentioned above.

wschin requested a review from souptc October 19, 2023 02:43

github-advanced-security bot found potential problems Oct 19, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems Oct 19, 2023

View reviewed changes

onnxruntime/test/python/onnxruntime_test_distributed.py Fixed Show fixed Hide fixed

wschin force-pushed the wechi/per-tensor-mesh branch 2 times, most recently from db85449 to 10bb2fd Compare October 19, 2023 18:23

souptc approved these changes Oct 24, 2023

View reviewed changes

wschin closed this Oct 24, 2023

wschin reopened this Oct 24, 2023

wschin force-pushed the wechi/per-tensor-mesh branch from 10bb2fd to 1641874 Compare October 24, 2023 17:49

wschin closed this Oct 25, 2023

wschin reopened this Oct 25, 2023

wschin force-pushed the wechi/per-tensor-mesh branch from 1641874 to d358409 Compare October 25, 2023 03:02

wschin changed the title ~~Support per-tensor device mesh at op level.~~ Support per-tensor device mesh at op level Oct 25, 2023

wschin force-pushed the wechi/per-tensor-mesh branch 2 times, most recently from ab6424e to 85e3087 Compare October 25, 2023 20:58

wschin mentioned this pull request Oct 26, 2023

Distributed Reshape Implementation #18068

Merged

Support per-tensor device mesh at op level.

40f985e

Since Reshape may change device mesh from, e.g., [0, 1] to [0, 1, 0, 1], we can't assume since device mesh per op. Lint resolve style warnings

wschin force-pushed the wechi/per-tensor-mesh branch from 85e3087 to 40f985e Compare October 26, 2023 17:23

wschin merged commit a514a68 into main Oct 26, 2023
88 of 90 checks passed

wschin deleted the wechi/per-tensor-mesh branch October 26, 2023 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support per-tensor device mesh at op level #18025

Support per-tensor device mesh at op level #18025

wschin commented Oct 19, 2023 •

edited

Loading

Support per-tensor device mesh at op level #18025

Support per-tensor device mesh at op level #18025

Conversation

wschin commented Oct 19, 2023 • edited Loading

wschin commented Oct 19, 2023 •

edited

Loading