[Pipeliner] Multi-buffer TMA descriptors #5290

peterbell10 · 2024-11-29T20:00:42Z

Commits in this PR

[Pipeliner] Multi-buffer TMA descriptors
Add tests for pipelined descriptor creation

PR chain

👉 [Pipeliner] Multi-buffer TMA descriptors #5290 👈 YOU ARE HERE

<git-pr-chain> [NFC] Remove unused forOp argument from `setStageCluster` #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. 👉 #5288 👈 **YOU ARE HERE** 1. #5289 1. #5290 </git-pr-chain>

…rs (#5289) <git-pr-chain> [TESTING] Add golden sample test for pipelining matmul with descriptors #### [PR chain](https://github.com/jlebar/git-pr-chain) 1. #5288 1. 👉 #5289 👈 **YOU ARE HERE** 1. #5290

⚠️

⚠️ Please **do not click the green "merge" button** unless you know what you're doing. This PR is part of a chain of PRs, and clicking the merge button will not merge it into master. ⚠️⚠️ </git-pr-chain>

git-pr-chain: pipeliner_multi_buffer_tma_descriptors_59c9

peterbell10 · 2024-12-02T22:39:13Z

include/triton/Dialect/TritonNvidiaGPU/Transforms/TMAUtilities.h

+constexpr inline int TMA_ALIGN = 128;
+
+template <typename BuilderT>
+mlir::LogicalResult createTMADesc(mlir::Value tmaPtr,


This is simply factored out from TMALowering.cpp with minimal changes

peterbell10 · 2024-12-02T22:41:30Z

python/src/passes.cc

@@ -31,6 +31,7 @@ void init_triton_passes_common(py::module &&m) {
  ADD_PASS_WRAPPER_0("add_canonicalizer", createCanonicalizerPass);
  ADD_PASS_WRAPPER_0("add_cse", createCSEPass);
  ADD_PASS_WRAPPER_0("add_licm", createLoopInvariantCodeMotionPass);
+  ADD_PASS_WRAPPER_0("print_ir", createPrintIRPass);


This is useful for debugging to print the IR at a specific point in the pipeline.

we are going to move the pipeliner to be a "uber pass" so this won't help those cases anymore unfortunately

It's also useful to generate the IR for golden sample tests. Just write a python script and add the prints before the passes we care about.

peterbell10 · 2024-12-02T22:44:47Z

python/tutorials/09-persistent-matmul.py

@@ -426,6 +427,9 @@ def matmul_kernel_device_tma_persistent(workspace_ptr,  #
    num_pid_in_group = GROUP_SIZE_M * num_pid_n

    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
+    # Create an opaque value to prevent the descriptor creation from being
+    # hoisted out of the loop
+    zero = tl.inline_asm_elementwise("mov.b32 $0, 0;", "=r", [], dtype=tl.int32, is_pure=True, pack=1)


One cool thing about tensor descriptors being IR values, they can be hoisted by LICM now.

I agree, this is quite nice to have value based tensor descriptor.
Why do we want to block LICM in this case?

ThomasRaoux

looks good to me. Added couple questions.
Would be good if @pawelszczerbuk could take a look at the pipelining as well.

ThomasRaoux · 2024-12-03T08:53:03Z

python/src/passes.cc

@@ -31,6 +31,7 @@ void init_triton_passes_common(py::module &&m) {
  ADD_PASS_WRAPPER_0("add_canonicalizer", createCanonicalizerPass);
  ADD_PASS_WRAPPER_0("add_cse", createCSEPass);
  ADD_PASS_WRAPPER_0("add_licm", createLoopInvariantCodeMotionPass);
+  ADD_PASS_WRAPPER_0("print_ir", createPrintIRPass);


we are going to move the pipeliner to be a "uber pass" so this won't help those cases anymore unfortunately

ThomasRaoux · 2024-12-03T08:54:09Z

python/tutorials/09-persistent-matmul.py

@@ -426,6 +427,9 @@ def matmul_kernel_device_tma_persistent(workspace_ptr,  #
    num_pid_in_group = GROUP_SIZE_M * num_pid_n

    accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
+    # Create an opaque value to prevent the descriptor creation from being
+    # hoisted out of the loop
+    zero = tl.inline_asm_elementwise("mov.b32 $0, 0;", "=r", [], dtype=tl.int32, is_pure=True, pack=1)


I agree, this is quite nice to have value based tensor descriptor.
Why do we want to block LICM in this case?

ThomasRaoux · 2024-12-03T09:04:45Z

lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp

+    // TODO peter: walk to loop yield to find the init value if this is a
+    // loop-carried value. That would save us from allocating another buffer
+    // just for the init value


why do we allocate an extra buffer for init value in this case? Wouldn't the buffer be allocated only at the place where we create a descriptor?

The initial loop value will be created outside the loop, most likely by a call to tt.make_tensor_descriptor. Since this is only replacing descriptor creation inside the loop, it will be missed and lowered by the fallback lowering instead.

ah interesting, yes the way the original loop is written is a bit weird as the first tt. make_tensor_descriptor is outside the loop. I wonder if we should write it differently to make it more friendly to the pipeliner even if that means having a separate if block originally. But that's a detail probably not worth looking at now.

peterbell10 requested a review from ptillet as a code owner November 29, 2024 20:00

This was referenced Nov 29, 2024

[NFC] Remove unused forOp argument from setStageCluster #5288

Merged

[TESTING] Add golden sample test for pipelining matmul with descriptors #5289

Merged

peterbell10 marked this pull request as draft November 29, 2024 20:01

peterbell10 force-pushed the pb/pr-chain/testing_add_golden_sample_test_for_pipel_b8bb branch from fddb007 to 076a5af Compare November 29, 2024 20:46

peterbell10 force-pushed the pb/pr-chain/pipeliner_multi_buffer_tma_descriptors_59c9 branch from 52bf1d8 to bb3cd77 Compare November 29, 2024 20:47

peterbell10 force-pushed the pb/pr-chain/testing_add_golden_sample_test_for_pipel_b8bb branch from 076a5af to 23ae3af Compare November 29, 2024 23:37

peterbell10 force-pushed the pb/pr-chain/pipeliner_multi_buffer_tma_descriptors_59c9 branch from bb3cd77 to 8cbbc22 Compare November 29, 2024 23:37

peterbell10 force-pushed the pb/pr-chain/testing_add_golden_sample_test_for_pipel_b8bb branch from 23ae3af to 0d7337c Compare November 29, 2024 23:40

peterbell10 force-pushed the pb/pr-chain/pipeliner_multi_buffer_tma_descriptors_59c9 branch from 8cbbc22 to 7bc334d Compare November 29, 2024 23:40

Base automatically changed from pb/pr-chain/testing_add_golden_sample_test_for_pipel_b8bb to main December 1, 2024 20:45

[Pipeliner] Multi-buffer TMA descriptors

7854d9b

git-pr-chain: pipeliner_multi_buffer_tma_descriptors_59c9

peterbell10 force-pushed the pb/pr-chain/pipeliner_multi_buffer_tma_descriptors_59c9 branch from 7bc334d to cb50eb3 Compare December 2, 2024 22:28

Add tests for pipelined descriptor creation

8243c50

peterbell10 force-pushed the pb/pr-chain/pipeliner_multi_buffer_tma_descriptors_59c9 branch from cb50eb3 to 8243c50 Compare December 2, 2024 22:48

peterbell10 commented Dec 2, 2024

View reviewed changes

peterbell10 marked this pull request as ready for review December 2, 2024 22:49

peterbell10 requested a review from ThomasRaoux December 2, 2024 22:49

ThomasRaoux approved these changes Dec 3, 2024

View reviewed changes

ThomasRaoux requested a review from pawelszczerbuk December 3, 2024 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pipeliner] Multi-buffer TMA descriptors #5290

[Pipeliner] Multi-buffer TMA descriptors #5290

peterbell10 commented Nov 29, 2024 •

edited

Loading

peterbell10 Dec 2, 2024

peterbell10 Dec 2, 2024

ThomasRaoux Dec 3, 2024

peterbell10 Dec 3, 2024

peterbell10 Dec 2, 2024 •

edited

Loading

ThomasRaoux Dec 3, 2024

ThomasRaoux left a comment

ThomasRaoux Dec 3, 2024

ThomasRaoux Dec 3, 2024

ThomasRaoux Dec 3, 2024

peterbell10 Dec 3, 2024 •

edited

Loading

ThomasRaoux Dec 3, 2024

[Pipeliner] Multi-buffer TMA descriptors #5290

Are you sure you want to change the base?

[Pipeliner] Multi-buffer TMA descriptors #5290

Conversation

peterbell10 commented Nov 29, 2024 • edited Loading

Commits in this PR

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 commented Nov 29, 2024 •

edited

Loading

peterbell10 Dec 2, 2024 •

edited

Loading

peterbell10 Dec 3, 2024 •

edited

Loading