New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[LLVMGPUVectorDistribute] Add general support for statically tiled codegen on dynamic shapes #19992

Closed

manupak wants to merge 8 commits into iree-org:main from manupak:distribute-mask-compute-v3

Contributor

manupak commented Feb 14, 2025 •

edited

Loading

This PR adds support to perform statically tiled codegen on dynamic shapes in vector distribute pipeline.
Basically, it could honor lowering configs on dynamic shapes using masking.

Some side-effect changes:

Currently block dynamic dimension pass, change the dimensionality of the generics without performing a projection of the lowering config that was provided higher up in the pipeline. Moreover, the requirement to do this becomes less now as we can tile generally on the dynamic dimension with the changes here -- unless Im missing something here.

This builds on the following PRs -- hence putting to draft:

future work:

(near future) Enable intrinsic distribution with selects.
(near future) Once [MLIR][Vector] Add support for inner-parallel masked multi-reductions llvm/llvm-project#126722 is merged and integrated, we can enable masking post-distribution.
[AMDGPU] once buffer descriptor based pointers are properly introduced, we could rewrite the masked loads to be just loads followed by selects. (This works with the prototype : [Codegen][ROCDL] Use buffer fat pointers where possible #19918)

manupak requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss, antiagainst and hanhanW as code owners

February 14, 2025 16:03

manupak marked this pull request as draft

February 14, 2025 16:04

manupak changed the title ~~[LLVMGPUVectorDistribute] Add general support to statically tiled codegen on dynamic shapes~~ [LLVMGPUVectorDistribute] Add general support for statically tiled codegen on dynamic shapes

manupak force-pushed the distribute-mask-compute-v3 branch 3 times, most recently from e460bdb to afe0147 Compare

February 24, 2025 16:50

manupak added 6 commits

February 26, 2025 02:41


          * Add an option to enable BlockDynamicDimensions on attention

82e7916

  Also, keeping it disabled by default until lowering config
  projection is fixed.
* enable masking in generic vectorization
* add two runs of resolve type to fold tensor.dim in rank reducing
  type.

Signed-off-by: Manupa Karunaratne <[email protected]>


          Add CLI flag to enable distributed

b81b402

masked compute.

Signed-off-by: Manupa Karunaratne <[email protected]>


          add vector distribute pipeline tests for

76ccf17

masked cases.

Signed-off-by: Manupa Karunaratne <[email protected]>


          * enable block dynamic attention in the unit test

fefcfd1

* only enable masking in vectorization in vector distribute

Signed-off-by: Manupa Karunaratne <[email protected]>


          fix tests after fixing mask creation in the bottom most

8646e71

PR.

Signed-off-by: Manupa Karunaratne <[email protected]>


          Remove cli flag block dynamic dims for attention

876a774

and add code not to run on ops where lowering config
is set.

Signed-off-by: Manupa Karunaratne <[email protected]>

manupak force-pushed the distribute-mask-compute-v3 branch from afe0147 to b4c35c3 Compare

February 26, 2025 10:59

manupak marked this pull request as ready for review

February 26, 2025 11:00

manupak force-pushed the distribute-mask-compute-v3 branch from b4c35c3 to 526cfc1 Compare

February 26, 2025 11:01

Contributor Author

manupak commented Feb 26, 2025

@Groverkss this is ready for review


          cleanup

526cfc1

Signed-off-by: Manupa Karunaratne <[email protected]>

Groverkss reviewed

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/BlockDynamicDimensions.cpp

Comment on lines +318 to +324

+                  // If lowering config is set, changing the dimensionality of
+                  // of the op will break the mapping. Therefore, skip operations
+                  // that has lowering config set.
+                  if (op->hasAttrOfType<IREE::Codegen::LoweringConfigAttrInterface>(
+                          "lowering_config")) {
+                    return success();
+                  }

Contributor

Groverkss Feb 26, 2025

I'm guessing this was used while debugging. We should remove this.

Contributor Author

manupak Feb 26, 2025

No ...
BlockDynamicDimension pass deletes the lowering config if it changes the linalg op.
Also the lowering config does not make sense after the dimensionality change.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir

Comment on lines +632 to +636

+              // CHECK: %[[MASK:.+]] = vector.create_mask %c1, %{{.+}}, %{{.+}} : vector<1x1x2xi1>
               // CHECK:             vector.transfer_read
-              // CHECK-SAME:        in_bounds = [true, false, false]
-              // CHECK-SAME:        memref<1x?x24xf32
+              // CHECK-SAME:        %[[MASK]]
+              // CHECK-SAME:        in_bounds = [true, true, true]
+              // CHECK-SAME:        memref<196x24x24xf32

Contributor

Groverkss Feb 26, 2025

Interesting. So instead of in_bounds attr, we are relying on masking. Does it produce the same code?

Contributor Author

manupak Feb 26, 2025

they both produce conditional code.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir

Comment on lines +1323 to +1329

+              // {indexing_maps = [
+              //   affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3)>,
+              //   affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d4, d3)>,
+              //   affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d4)
+              // ],
+              // iterator_types = ["parallel", "parallel", "parallel", "reduction", "parallel"]
+              // }

Contributor

Groverkss Feb 26, 2025

Remove commented out code

Contributor Author

manupak Feb 26, 2025 •

edited

Loading

Hmmm I thought its useful to understand the lowering config dimensionality of
QK and PV matmul generics.
Otherwise it appears as a set of magic numbers that is not represented in linalg_ext.attention op.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir Outdated

Comment on lines 1356 to 1357

		hal.executable.export public @attention_dynamic_masked ordinal(0) layout(#hal.pipeline.layout<constants = 6, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly\|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly\|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly\|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly\|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) {
		^bb0(%arg0: !hal.device, %arg1: index, %arg2: index, %arg3: index):

Contributor

Groverkss Feb 26, 2025

nit: We don't need pipeline binding flags like "ReadOnly|Indirect" for tests. Check how other tests do it.

Contributor Author

manupak Feb 26, 2025

done.

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir

+              }
+              module {
+                hal.executable public @decode_attn_dispatch_0 {

Contributor

Groverkss Feb 26, 2025

This test should be in pipeline_vector_distribute_gfx942_reduction.mlir (or whever the reduction test file is called)

Contributor Author

manupak Feb 26, 2025

Why? all other attention tests are here...

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir

+              // -----
+              #translation = #iree_codegen.translation_info<pipeline = LLVMGPUVectorDistribute workgroup_size = [256, 1, 1] subgroup_size = 64>
+              #lowering_config = #iree_gpu.lowering_config<{reduction = [0, 0, 0, 0, 0, 512], workgroup = [1, 1, 1, 32, 0, 0]}>

Contributor

Groverkss Feb 26, 2025

Can you use partial_reduction instead of reduction here?

Contributor

Groverkss Feb 26, 2025

Also, we should be tiling the outer K2 dimension to number of warps?

Contributor Author

manupak Feb 26, 2025 •

edited

Loading

This is just an example.
Would you be able to spell out the config that you d like tested here ?

compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx942.mlir Outdated

Comment on lines 1397 to 1400

+                        %27 = hal.interface.binding.subspan layout(<constants = 6, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4x32x?x128xf16>>{%24}
+                        %28 = hal.interface.binding.subspan layout(<constants = 6, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4x32x128x?xf16>>{%25}
+                        %29 = hal.interface.binding.subspan layout(<constants = 6, bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(3) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<4x32x1x?xf16>>{%26}
+                        %30 = flow.dispatch.tensor.load %22, offsets = [0, 0, 0, 0], sizes = [4, 32, 1, 128], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x32x1x128xf16>> -> tensor<4x32x1x128xf16>

Contributor

Groverkss Feb 26, 2025

Can we use a simpler test? I don't think we need all these pipeline.binding flags.

Contributor Author

manupak Feb 26, 2025

I can try; for some reason I thought its needed for the test as every other test in the file.

Contributor Author

manupak Feb 26, 2025

oh you mean remove the flags but keep the hal ?

Contributor Author

manupak Feb 26, 2025

ok Ive removed the flags and made it look more similiar to other tests.


          * cleanup tests

f373b7a

Signed-off-by: Manupa Karunaratne <[email protected]>

Contributor

Groverkss commented Mar 3, 2025

Moving PR to #20144 so I can land new changes to the PR. Thanks for the work @manupak !

Groverkss closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Groverkss Groverkss left review comments

MaheshRavishankar Awaiting requested review from MaheshRavishankar MaheshRavishankar is a code owner

qedawkins Awaiting requested review from qedawkins qedawkins is a code owner

kuhar Awaiting requested review from kuhar kuhar is a code owner

antiagainst Awaiting requested review from antiagainst

hanhanW Awaiting requested review from hanhanW hanhanW is a code owner

Labels

None yet