-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use blockwise_broadcast_reduce
in reduction fusions.
#1668
Use blockwise_broadcast_reduce
in reduction fusions.
#1668
Conversation
TODO : from creating a new view to represent partial reduction in a global view.
a blockwise_reduction to reduce the number of atomic_add s issued.
* fixed blocksize
There is a fractional dim being create due to reduction being applied on a partial dimensions
* add asserts to where fractional dims are created when removing upperDims
TODO : support cases where subdim is removed in pad
…MLIR into fuse-reduction-blockwise-reduce
* Currently it has has code obtain all views leading to gemm.
back to sub dimensions from upper to lower.
with DPerBlock
Next: output transposes
but I still see blocking non dividability. me need re-think transposes
* abort if reduce to gemm is not invertible * return failure for removeUPperDims divisbility cases so that blockwise reductions are not used.
renamed pass TODO : fix recombine logic
* added another iface for getLowerSubDIms
with additional views
LLVM_DEBUG(llvm::dbgs() | ||
<< "readOperand = " << readOperand->get() << "\n"); | ||
// Test against the write operand to guard against [MemRead, MemWrite] | ||
if (readOperand && readOperand != writerOperand && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't BufferDependencyAnalysis
do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or have this cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is heavily inspired by your code here :
rocMLIR/mlir/lib/Dialect/Rock/Transforms/Regularize.cpp
Lines 282 to 290 in 82378ac
// Test against the write operand to guard against [MemRead, MemWrite] | |
if (maybeRecursiveReadOperand && | |
maybeRecursiveReadOperand != writeOperand && | |
isa<MemoryEffects::Read>(effect.getEffect())) { | |
collectInputFusionWriteOperands(maybeRecursiveReadOperand, bufferDeps, | |
state); | |
} | |
} | |
} |
I think its safer to have here as well.
in aligntiling -- and allow it fail rather than having asserts if something is as not expected.
blockwise_broadcast_reduce
in reduction fusions.blockwise_broadcast_reduce
in reduction fusions.
Thanks @krzysz00 for all the reviews on this big PR. I really appreciate it. I have not added the unit tests for passes yet and aling-tiling yet... which I will do next. |
%11 = migraphx.mul %4, %4 : <2x32x10x64x64xf32, 1310720x40960x4096x64x1>, <2x32x10x64x64xf32, 1310720x40960x4096x64x1> -> <2x32x10x64x64xf32, 1310720x40960x4096x64x1> | ||
%12 = migraphx.mul %11, %10 : <2x32x10x64x64xf32, 1310720x40960x4096x64x1>, <2x32x10x64x64xf32, 0x0x0x0x0> -> <2x32x10x64x64xf32, 1310720x40960x4096x64x1> | ||
%13 = migraphx.reshape %12 {dims = [2, 32, 40960]} : <2x32x10x64x64xf32, 1310720x40960x4096x64x1> -> <2x32x40960xf32, 1310720x40960x1> | ||
%14 = migraphx.reduce_sum %13 {axes = [2]} : <2x32x40960xf32, 1310720x40960x1> -> <2x32x1xf32, 32x1x1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add tests where the axis of reduction is not 2. Also, tests where m/n is not reduced at all. This is to test the fixes introduced today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. will add them.
I also need to fix/add unit test for align-tiling/rock-shuffle-gemm-for-reductions
Add n dimension only reduction test case
@dhernandez0 I ve added the promised tests now! |
Thanks! I was wondering if we should add a test where the reduction occurs along the G axis only for completeness. I understand it would only involve atomics, but it might be good to include it just in case. I've already approved the PR, so whatever you decide. |
Currently, we handle reductions that are being fused with gemm-like operations by using atomic stores to the destination buffer. This can be cripplingly slow when most of the output is being reduced as evidenced in layer_norm cases. This PR adds the ability to blockwise_broadcast_reduce on the block sub-tiles of of gemm output. However, in-order to that we need to make sure the reduction dimension is uniformly distributed across the blocks. This is achieved by : Firstly, this PR introduces a utility where for a given set of upper dimensions, it can traverse a transform stack and produce a list of sub-dimensions per each lower dimension where the upper reduction axes are mapped to. Then, this PR introduces ShuffleGemmForReductions pass, which will split and transpose the parallel dimension of the gemm such that reduction dimension is uniformly distributed across the blocks. Then at AlignTiling pass, we extract the block subtile when fusing in the rock.reduce operator. Then perform a blockwise_broadcast_reduce on the block subtile. Since we only want to write the partial reductions per block, we pad out broadcasted part of the subtile. (We rely on any block coordinate that goes to the padded region within the block will not be written out) Then we need to do Recombine the modified sub-tile coordinate maps with grid-only coordinates maps. a) Here, we drop all the upper-dimensions except g_block, m_block and n_block and obtain the grid-only transform map stack. b) In parallel, we re-use getLowerSubDimensions utility to figure out which sub-dimension gets mapped with the above grid-only dimensions. c) Then we extract of those sub-dimensions in a bottom up fashion and stitch it up with the said grid-only transform map stack.
[TODO] : add more e2e tests and unit tests for passes.Currently, we handle reductions that are being fused with gemm-like operations
by using atomic stores to the destination buffer. This can be cripplingly slow when
most of the output is being reduced as evidenced in layer_norm cases.
This PR adds the ability to
blockwise_broadcast_reduce
on the block sub-tiles ofof gemm output.
However, in-order to that we need to make sure the reduction dimension is uniformly
distributed across the blocks. This is achieved by :
traverse a transform stack and produce a list of sub-dimensions per each lower dimension
where the upper reduction axes are mapped to.
ShuffleGemmForReductions
pass, which will split and transposethe parallel dimension of the gemm such that reduction dimension is uniformly
distributed across the blocks.
rock.reduce
operator.Then perform a
blockwise_broadcast_reduce
on the block subtile.(We rely on any block coordinate that goes to the padded region within the block will not be written out)
Recombine
the modified sub-tile coordinate maps with grid-only coordinates maps.a) Here, we drop all the upper-dimensions except
g_block
,m_block
andn_block
and obtain the grid-only transform mapstack.
b) In parallel, we re-use getLowerSubDimensions utility to figure out which sub-dimension gets mapped with the above
grid-only dimensions.
c) Then we extract of those sub-dimensions in a bottom up fashion and stitch it up with the said grid-only transform map
stack.
I ll try to create some slides to explain all these.
In the cases I ve tests, this yields two orders of magnitude (~100x) gains over the pure atomics approach of doing reduction fusions.