Use `blockwise_broadcast_reduce` in reduction fusions. #1668

manupak · 2024-10-03T13:40:59Z

~~[TODO] : add more e2e tests and unit tests for passes.~~

Currently, we handle reductions that are being fused with gemm-like operations
by using atomic stores to the destination buffer. This can be cripplingly slow when
most of the output is being reduced as evidenced in layer_norm cases.

This PR adds the ability to blockwise_broadcast_reduce on the block sub-tiles of
of gemm output.
However, in-order to that we need to make sure the reduction dimension is uniformly
distributed across the blocks. This is achieved by :

Firstly, this PR introduces a utility where for a given set of upper dimensions, it can
traverse a transform stack and produce a list of sub-dimensions per each lower dimension
where the upper reduction axes are mapped to.
Then, this PR introduces ShuffleGemmForReductions pass, which will split and transpose
the parallel dimension of the gemm such that reduction dimension is uniformly
distributed across the blocks.
Then at AlignTiling pass, we extract the block subtile when fusing in the rock.reduce operator.
Then perform a blockwise_broadcast_reduce on the block subtile.
Since we only want to write the partial reductions per block, we pad out broadcasted part of the subtile.
(We rely on any block coordinate that goes to the padded region within the block will not be written out)
Then we need to do Recombine the modified sub-tile coordinate maps with grid-only coordinates maps.
a) Here, we drop all the upper-dimensions except g_block, m_block and n_block and obtain the grid-only transform map
stack.
b) In parallel, we re-use getLowerSubDimensions utility to figure out which sub-dimension gets mapped with the above
grid-only dimensions.
c) Then we extract of those sub-dimensions in a bottom up fashion and stitch it up with the said grid-only transform map
stack.

I ll try to create some slides to explain all these.

In the cases I ve tests, this yields two orders of magnitude (~100x) gains over the pure atomics approach of doing reduction fusions.

TODO : from creating a new view to represent partial reduction in a global view.

…-blockwise-reduce

a blockwise_reduction to reduce the number of atomic_add s issued.

* fixed blocksize

There is a fractional dim being create due to reduction being applied on a partial dimensions

* add asserts to where fractional dims are created when removing upperDims

TODO : support cases where subdim is removed in pad

…MLIR into fuse-reduction-blockwise-reduce

* Currently it has has code obtain all views leading to gemm.

back to sub dimensions from upper to lower.

with DPerBlock

Next: output transposes

but I still see blocking non dividability. me need re-think transposes

* abort if reduce to gemm is not invertible * return failure for removeUPperDims divisbility cases so that blockwise reductions are not used.

* make th shuffle pass run on largest reduction

renamed pass TODO : fix recombine logic

* added another iface for getLowerSubDIms

…-blockwise-reduce

with additional views

mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp

krzysz00 · 2024-10-09T15:20:23Z

mlir/lib/Dialect/Rock/Transforms/ShuffleGemmForReductions.cpp

+      LLVM_DEBUG(llvm::dbgs()
+                 << "readOperand = " << readOperand->get() << "\n");
+      // Test against the write operand to guard against [MemRead, MemWrite]
+      if (readOperand && readOperand != writerOperand &&


Doesn't BufferDependencyAnalysis do this?

Or have this cached?

This is heavily inspired by your code here :

rocMLIR/mlir/lib/Dialect/Rock/Transforms/Regularize.cpp

Lines 282 to 290 in 82378ac

// Test against the write operand to guard against [MemRead, MemWrite]

if (maybeRecursiveReadOperand &&

maybeRecursiveReadOperand != writeOperand &&

isa<MemoryEffects::Read>(effect.getEffect())) {

collectInputFusionWriteOperands(maybeRecursiveReadOperand, bufferDeps,

state);

}

}

}

.

I think its safer to have here as well.

mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp

in aligntiling -- and allow it fail rather than having asserts if something is as not expected.

manupak · 2024-10-10T14:21:24Z

Thanks @krzysz00 for all the reviews on this big PR. I really appreciate it.
I have addressed all of them

I have not added the unit tests for passes yet and aling-tiling yet... which I will do next.
@dhernandez0 would you be able to take a look here ?

mlir/lib/Dialect/Rock/IR/RockDialect.cpp

mlir/lib/Dialect/Rock/Transforms/ShuffleGemmForReductions.cpp

mlir/lib/Dialect/Rock/utility/transformMapUtils.cpp

mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp

mlir/lib/Dialect/Rock/Tuning/GridwiseGemmParams.cpp

mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp

mlir/lib/Dialect/Rock/Transforms/ShuffleGemmForReductions.cpp

refactor

mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp

dhernandez0 · 2024-10-11T15:33:46Z

mlir/test/fusion/pr-e2e/mixr-multi-reduce-mo-4d.mlir

+    %11 = migraphx.mul %4, %4 : <2x32x10x64x64xf32, 1310720x40960x4096x64x1>, <2x32x10x64x64xf32, 1310720x40960x4096x64x1> -> <2x32x10x64x64xf32, 1310720x40960x4096x64x1>
+    %12 = migraphx.mul %11, %10 : <2x32x10x64x64xf32, 1310720x40960x4096x64x1>, <2x32x10x64x64xf32, 0x0x0x0x0> -> <2x32x10x64x64xf32, 1310720x40960x4096x64x1>
+    %13 = migraphx.reshape %12 {dims = [2, 32, 40960]} : <2x32x10x64x64xf32, 1310720x40960x4096x64x1> -> <2x32x40960xf32, 1310720x40960x1>
+    %14 = migraphx.reduce_sum %13 {axes = [2]} : <2x32x40960xf32, 1310720x40960x1> -> <2x32x1xf32, 32x1x1>


nit: add tests where the axis of reduction is not 2. Also, tests where m/n is not reduced at all. This is to test the fixes introduced today.

Thanks. will add them.
I also need to fix/add unit test for align-tiling/rock-shuffle-gemm-for-reductions

with docs

Add n dimension only reduction test case

manupak · 2024-10-14T14:10:16Z

@dhernandez0 I ve added the promised tests now!

dhernandez0 · 2024-10-14T14:18:41Z

@dhernandez0 I ve added the promised tests now!

Thanks! I was wondering if we should add a test where the reduction occurs along the G axis only for completeness. I understand it would only involve atomics, but it might be good to include it just in case. I've already approved the PR, so whatever you decide.

Currently, we handle reductions that are being fused with gemm-like operations by using atomic stores to the destination buffer. This can be cripplingly slow when most of the output is being reduced as evidenced in layer_norm cases. This PR adds the ability to blockwise_broadcast_reduce on the block sub-tiles of of gemm output. However, in-order to that we need to make sure the reduction dimension is uniformly distributed across the blocks. This is achieved by : Firstly, this PR introduces a utility where for a given set of upper dimensions, it can traverse a transform stack and produce a list of sub-dimensions per each lower dimension where the upper reduction axes are mapped to. Then, this PR introduces ShuffleGemmForReductions pass, which will split and transpose the parallel dimension of the gemm such that reduction dimension is uniformly distributed across the blocks. Then at AlignTiling pass, we extract the block subtile when fusing in the rock.reduce operator. Then perform a blockwise_broadcast_reduce on the block subtile. Since we only want to write the partial reductions per block, we pad out broadcasted part of the subtile. (We rely on any block coordinate that goes to the padded region within the block will not be written out) Then we need to do Recombine the modified sub-tile coordinate maps with grid-only coordinates maps. a) Here, we drop all the upper-dimensions except g_block, m_block and n_block and obtain the grid-only transform map stack. b) In parallel, we re-use getLowerSubDimensions utility to figure out which sub-dimension gets mapped with the above grid-only dimensions. c) Then we extract of those sub-dimensions in a bottom up fashion and stitch it up with the said grid-only transform map stack.

manupak added 30 commits August 20, 2024 12:02

Initial code upto insertion of blockwise_reduce

88e7e4e

TODO : from creating a new view to represent partial reduction in a global view.

Merge branch 'develop' of github.com:ROCm/rocMLIR into fuse-reduction…

b074eb6

…-blockwise-reduce

Added code for align tiling to insert

c4a64e7

a blockwise_reduction to reduce the number of atomic_add s issued.

* Moved blockwise to threadwise lowering after linalg align

12c5a27

* fixed blocksize

* add a e2e mixr test

6d4c17d

Its not actually working...

24266d4

There is a fractional dim being create due to reduction being applied on a partial dimensions

* add debug prints under LLVM_DEBUG()

96828db

* add asserts to where fractional dims are created when removing upperDims

Added support for non-removed subdims + padding

ed44a52

TODO : support cases where subdim is removed in pad

Added support removeSubDim in padded dim + tests

bd3bfbc

* clang format

a535d83

Merge branch 'padding-for-remove-upperdims' of github.com:manupak/roc…

2327d5f

…MLIR into fuse-reduction-blockwise-reduce

Add pad reductions fusions pass

4612146

* Currently it has has code obtain all views leading to gemm.

* add a utility to trace a dimension through a tr stack

2cf1280

back to sub dimensions from upper to lower.

WIP trying resplit reduction dimension into gcd d

8070317

with DPerBlock

Added input reordering

356edcf

Input tranposes are done.

a2becf9

Next: output transposes

tranposing pass runs..

6d9b8b3

but I still see blocking non dividability. me need re-think transposes

* test case working

fc01f86

* remove printing

b035ebe

* tuning loop runs for that test case

a0b281e

* tuning loop really runs for the test case

dd3436a

* abort if reduce to gemm is not invertible * return failure for removeUPperDims divisbility cases so that blockwise reductions are not used.

*clean up

0666a67

* make th shuffle pass run on largest reduction

Further clean up

ce100d1

renamed pass TODO : fix recombine logic

* fixed recombine logic

c32c03e

* added another iface for getLowerSubDIms

* handle tail subdims

00348a5

* make the test case check for all outputs

fe74b69

* check-rocmlir works

d80b113

Merge branch 'develop' of github.com:ROCm/rocMLIR into fuse-reduction…

ea3797e

…-blockwise-reduce

*clang-format

8eb873c

* fixed the bug that we should only be changing the gemm output

b7ea3cb

with additional views

Address review feedback : AlignTiling

9f336af

krzysz00 reviewed Oct 9, 2024

View reviewed changes

manupak added 4 commits October 10, 2024 13:47

Refactor blokwise_reduction insertions into a function

5eecb45

in aligntiling -- and allow it fail rather than having asserts if something is as not expected.

* clang format

53af374

use untransform

46e4f23

Address review feedback

3c2dd0a

manupak changed the title ~~[PROTOTYPE] Use blockwise_broadcast_reduce in reduction fusions.~~ Use blockwise_broadcast_reduce in reduction fusions. Oct 10, 2024

manupak marked this pull request as ready for review October 10, 2024 14:15

manupak requested a review from dhernandez0 October 10, 2024 14:15

dhernandez0 reviewed Oct 11, 2024

View reviewed changes

manupak added 3 commits October 11, 2024 15:03

Address Daniel's feedback except the recombine

97042dc

refactor

clang format

7c21a38

Add more docs for could be performant

c62c706

dhernandez0 reviewed Oct 11, 2024

View reviewed changes

mlir/lib/Dialect/Rock/Transforms/AlignTiling.cpp Outdated Show resolved Hide resolved

* further address comments

73083be

dhernandez0 approved these changes Oct 11, 2024

View reviewed changes

dhernandez0 reviewed Oct 11, 2024

View reviewed changes

manupak added 5 commits October 11, 2024 17:28

Refactor recombine step into a new function

5dcf9d7

with docs

* add align tiling unit tests

21e7d4c

Add reduction axis 1 test case

0201bf5

Add n dimension only reduction test case

* add unit test for shuffle gemm pass

e4354ab

* fix missing new lines

b9536d5

manupak merged commit 1dea35b into ROCm:develop Oct 15, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `blockwise_broadcast_reduce` in reduction fusions. #1668

Use `blockwise_broadcast_reduce` in reduction fusions. #1668

manupak commented Oct 3, 2024 •

edited

Loading

krzysz00 Oct 9, 2024

krzysz00 Oct 9, 2024

manupak Oct 10, 2024

manupak commented Oct 10, 2024

dhernandez0 Oct 11, 2024 •

edited

Loading

manupak Oct 11, 2024

manupak commented Oct 14, 2024

dhernandez0 commented Oct 14, 2024 •

edited

Loading

	// Test against the write operand to guard against [MemRead, MemWrite]
	if (maybeRecursiveReadOperand &&
	maybeRecursiveReadOperand != writeOperand &&
	isa<MemoryEffects::Read>(effect.getEffect())) {
	collectInputFusionWriteOperands(maybeRecursiveReadOperand, bufferDeps,
	state);
	}
	}
	}

Use blockwise_broadcast_reduce in reduction fusions. #1668

Use blockwise_broadcast_reduce in reduction fusions. #1668

Conversation

manupak commented Oct 3, 2024 • edited Loading

krzysz00 Oct 9, 2024

Choose a reason for hiding this comment

krzysz00 Oct 9, 2024

Choose a reason for hiding this comment

manupak Oct 10, 2024

Choose a reason for hiding this comment

manupak commented Oct 10, 2024

dhernandez0 Oct 11, 2024 • edited Loading

Choose a reason for hiding this comment

manupak Oct 11, 2024

Choose a reason for hiding this comment

manupak commented Oct 14, 2024

dhernandez0 commented Oct 14, 2024 • edited Loading

Use `blockwise_broadcast_reduce` in reduction fusions. #1668

Use `blockwise_broadcast_reduce` in reduction fusions. #1668

manupak commented Oct 3, 2024 •

edited

Loading

dhernandez0 Oct 11, 2024 •

edited

Loading

dhernandez0 commented Oct 14, 2024 •

edited

Loading