[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for `local_alloc` ops #4763

Jokeren · 2024-09-19T19:23:28Z

This PR enables the use of stmatrix for local_alloc ops through linear layout and removes the legacy code from the TargetInfo class.

…en/local-alloc

jlebar

In general I don't really understand the new code, could use help by way of additional comments.
Have we run performance tests?
Perhaps we can involve someone else from the Triton team so they can start learning this stuff?

jlebar · 2024-09-26T16:16:39Z

include/triton/Dialect/TritonGPU/IR/LinearLayoutConversions.h

+// In the swizzled layout, the leading dimension (i.e., column dimension) is
+// strided by swizzleByteSize.  For example, in a matrix of size 128x128 with a
+// data type of f16, stored in shared memory using 128B-swizzle mode, the offset
+// of the element at index (1, 0) will be 72 due to the stride.  Without


If you have a 128x128 matrix with no swizzling, I would have thought that the element at index (1,0) would be at offset 128. How do we get 64 and 72?

(I also don't see how this thing has to do with swizzling. As described is it just the stride of dimension 1, measured in bytes?)

If you have a 128x128 matrix with no swizzling, I would have thought that the element at index (1,0) would be at offset 128. How do we get 64 and 72?

My bad... I'll just remove comment about the offset without swizzling.

72 should be corrected with 128 + 2 bytes/element * 8 swizzled coordinate = 144 if we use the number of bytes as the unit of offset.

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

Jokeren · 2024-09-26T17:36:23Z

Have we run performance tests?

Yeah, no regression found.

Perhaps we can involve someone else from the Triton team so they can start learning this stuff?

I'll discuss with Phil and Thomas next week regarding this issue. Sorry about consistently bothering you on reviewing these PRs.

Jokeren · 2024-09-26T17:38:22Z

In general I don't really understand the new code, could use help by way of additional comments.

Might be better if you could point out me which parts you're not clear about?

jlebar · 2024-09-27T01:35:13Z

Sorry about consistently bothering you on reviewing these PRs.

No problem, it would just be good to reduce the bus factor here.

Might be better if you could point out me which parts you're not clear about?

Just the ones already pointed out.

Jokeren · 2024-09-27T01:49:26Z

Just the ones already pointed out.

OK, let me revisit the implementation tomorrow and think about a better explanation

Jokeren · 2024-09-29T01:23:53Z

Hi @jlebar , comments have been updated

ThomasRaoux

LGTM

jlebar · 2024-09-30T16:56:09Z

Sorry, it looks like I had review comments that never made it in. Feel free to ignore if you want, since you already have an LGTM from the Triton team.

jlebar · 2024-09-27T01:20:11Z

include/triton/Dialect/TritonGPU/IR/LinearLayoutConversions.h

+// data type of f16, stored in shared memory using 128B swizzle mode, the offset
+// of the element at index (1, 0) will be 128B + 2B * 8 (vector_width) = 144B
+// due to the stride.  However, if we apply swizzling without a leading offset,
+// the offset would be 2B * 128 (num_columns) + 2B * 8 (vector_width) = 272B.


I almost get it, but maybe if you wrote it as index (x,y) instead of (1,0) then it would be clear? Or is the formula too gross?

The comments have been refactored significantly. Should have information you want now?

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

jlebar · 2024-09-27T01:23:19Z

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

  return true;
 }

-} // anonymous namespace
+std::optional<LinearLayout> chooseStMatrixLayoutLeadingOffset(


Presumably this function should have some unit tests?

Yeah, I will add a test when investigating peter's issue. Seems like there're still some problems.
#4727

lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

jlebar · 2024-09-27T01:28:04Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.cpp

+  auto vals = unpackLLVector(loc, val, rewriter);
+  SmallVector<Value> inputs;
+  // Pack the input into 2xf16
+  Type packedTy = vec_ty(vals[0].getType(), 2);


is vals[0].getType() the same as elemTy?

jlebar · 2024-09-27T01:28:26Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.cpp

-    return true;
+  auto vals = unpackLLVector(loc, val, rewriter);
+  SmallVector<Value> inputs;
+  // Pack the input into 2xf16


Do we want to assert that elemTy is f16 (or is a 16-bit scalar value or something?)

jlebar · 2024-09-27T01:28:53Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TargetInfo.cpp

+  for (int i = 0; i < 4; i++) {
+    Value input = undef(packedTy);
+    for (int j = 0; j < 2; j++) {
+      input = insert_element(packedTy, input, vals[i * 2 + j], i32_val(j));


Do we want to assert something about the size of vals? Otherwise this could silently read off the end of the array?

Jokeren · 2024-09-30T17:33:26Z

I did some searching and I'm not finding where this was copy-pasted from (not sure which old code you're referring to), but yeah, in order to review for correctness I think I need to understand it.

Oh, to be clear, it's not a direct copy and paste.

It's based on the following lines:

triton/lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp

Line 397 in 1df64d1

int tileWidthBytes;

jlebar

Yes, this is much more helpful, thank you!!

Jokeren · 2024-10-01T00:52:44Z

Assertions have been added for the data type and number of elements. Merge this PR into main now.

…ocal_alloc` ops (triton-lang#4763) This PR enables the use of `stmatrix` for `local_alloc` ops through linear layout and removes the legacy code from the `TargetInfo` class.

Jokeren and others added 12 commits September 18, 2024 09:12

Update

33a641b

Update

8bd95d7

Update

d90e5d5

Update

98ab9c8

Update

54b3f67

Update

41a6905

Update

7527dbe

Add comments

a72338b

Update

d2f5de2

Merge branch 'main' into keren/local-alloc

4d2b66b

Update comment

544d5c3

Merge branch 'keren/local-alloc' of github.com:openai/triton into ker…

f6735fa

…en/local-alloc

Jokeren changed the title ~~[DRAFT][BACKEND] Linear Layout with stmatrix part 2: support stmatrix for local_alloc ops~~ [BACKEND] Linear Layout with stmatrix part 2: support stmatrix for local_alloc ops Sep 25, 2024

Jokeren requested review from jlebar and ThomasRaoux September 25, 2024 02:11

Merge branch 'main' into keren/local-alloc

37a4e62

Jokeren marked this pull request as ready for review September 25, 2024 02:13

Jokeren requested review from antiagainst, zhanglx13 and ptillet as code owners September 25, 2024 02:13

jlebar reviewed Sep 26, 2024

View reviewed changes

Update

db3381c

Jokeren added 3 commits September 28, 2024 21:16

Update

7acfaee

Update

ed974d0

Update

8f455af

Jokeren added 3 commits September 28, 2024 21:24

Update

3c9f28b

Remove iteration in comments

ff8bf51

Update

d14afef

ThomasRaoux approved these changes Sep 30, 2024

View reviewed changes

jlebar reviewed Sep 30, 2024

View reviewed changes

jlebar approved these changes Sep 30, 2024

View reviewed changes

Jokeren added 2 commits September 30, 2024 20:12

Update

46b1e69

Update

d7be8ba

Jokeren merged commit 49266aa into main Oct 1, 2024
7 checks passed

Jokeren deleted the keren/local-alloc branch October 1, 2024 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for `local_alloc` ops #4763

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for `local_alloc` ops #4763

Jokeren commented Sep 19, 2024 •

edited

Loading

jlebar left a comment

jlebar Sep 26, 2024

Jokeren Sep 26, 2024 •

edited

Loading

Jokeren Sep 26, 2024 •

edited

Loading

Jokeren commented Sep 26, 2024

Jokeren commented Sep 26, 2024

jlebar commented Sep 27, 2024

Jokeren commented Sep 27, 2024

Jokeren commented Sep 29, 2024

ThomasRaoux left a comment

jlebar commented Sep 30, 2024

jlebar Sep 27, 2024

Jokeren Sep 30, 2024

jlebar Sep 27, 2024

Jokeren Oct 1, 2024

jlebar Sep 27, 2024

jlebar Sep 27, 2024

jlebar Sep 27, 2024

Jokeren commented Sep 30, 2024

jlebar left a comment

Jokeren commented Oct 1, 2024

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for local_alloc ops #4763

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for local_alloc ops #4763

Conversation

Jokeren commented Sep 19, 2024 • edited Loading

jlebar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Jokeren Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Jokeren commented Sep 26, 2024

Jokeren commented Sep 26, 2024

jlebar commented Sep 27, 2024

Jokeren commented Sep 27, 2024

Jokeren commented Sep 29, 2024

ThomasRaoux left a comment

Choose a reason for hiding this comment

jlebar commented Sep 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jokeren commented Sep 30, 2024

jlebar left a comment

Choose a reason for hiding this comment

Jokeren commented Oct 1, 2024

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for `local_alloc` ops #4763

[BACKEND] Linear Layout with stmatrix part 2: support stmatrix for `local_alloc` ops #4763

Jokeren commented Sep 19, 2024 •

edited

Loading

Jokeren Sep 26, 2024 •

edited

Loading

Jokeren Sep 26, 2024 •

edited

Loading