[BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic #4951

Jokeren · 2024-10-18T20:07:04Z

This PR fixes the cvtReordersRegisters method, which previously could not return true for two layouts with different numbers of registers. With this update, we can remove the legacy isMmaToDotShortcut and its associated shortcut conversion.

Additionally, we store the dot operand results in the access order to improve code clarity.

Going forward, we intend to eliminate unnecessary shortcut conversions and replace them with the use of transferWithinThread.

Jokeren · 2024-10-21T02:43:30Z

@zhanglx13 and @antiagainst you may want to take a look as well.
From this PR going on, I think we can try to remove all short cut functions, including the ones introduced by AMD.
As long as a layout has a correct and well-defined linear layout, I think cvtReordersRegisters is sufficient to determine if we can permute registers to perform a conversion without shared memory.

Jokeren · 2024-10-21T02:47:38Z

@ThomasRaoux feel free to run a regression test on the PR.

I don't think there should be any issues since I only changed the register access order, but I just wanted to catch potential problems early.

lezcano · 2024-10-21T16:22:02Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/UpcastMXFPToLLVM.cpp

@@ -80,19 +80,6 @@ class UpcastMXFPOpPattern : public ConvertOpToLLVMPattern<UpcastMXFPOp> {
        ret.push_back(v);
      }
    }
-    // FIXME [Dot LL]


lezcano · 2024-10-21T16:23:35Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/MMAv2.cpp

@@ -75,9 +75,39 @@ ValueTableV2 getValuesFromDotOperandLayoutStruct(

    // For kWidth = 8, split the mma into 4 mmas with "stride 4" along K
    if (dot.getOpIdx() == 0) {
-      si = llvm::SmallVector<unsigned>{0, 8,  4, 12, 1, 9,  5, 13,
-                                       2, 10, 6, 14, 3, 11, 7, 15};
+      // Original register layout:


Thank you for making the comments more explicit!

lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp

lezcano

After thinking a bit about it, I think I understand why padding fixes the issues we were seeing when the inputs and outputs have a different number of registers.
The issue stems from the function

triton/lib/Tools/LinearLayout.cpp

Line 119 in 1064b59

getInjectiveMat(const LinearLayout &layout) {

This function makes both the matrices injective by extending their codomain. This is an issue if the inDims have different dimensions, as the codomains will now differ, which is a precondition for the Gaussian elimination to make sense!

The padding patch mitigates this in the cases we found in practice, as it so happens that padding matches perfectly all the free variables from the two matrices, so getInjectiveMat turns this transformation into the identity, which is perfect.

I don't think that this is the correct approach in general, but it's clearly an improvement over the previous state, so approving. I think I have a solution for the general problem, but I'll implement that at a later point.

Also, thank you @Jokeren for finding the adversarial examples and adding tests for them!

author Jokeren <[email protected]> 1729209426 -0400 committer Jokeren <[email protected]> 1729880369 -0400 Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

lezcano

The diff is a bit big with quite a few unnecessary changes (packI32 -> packI32s etc). It'd be nice to keep the diff small if possible, but perhaps it's fine for this PR given that we're going to keep editing this code and most of what this PR adds will eventually be removed.

Feel free to merge otherwise.

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

@lezcano

* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09) * [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994) This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`. (cherry picked from commit 3549db8) * [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks) (cherry picked from commit 23c9ec1) * [BACKEND] Improve detection of register to register conversion (triton-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55) * [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe) * [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038) We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary functions to make the intentions a bit clearer. We add a few asserts in the `LinearLayoutConversion` to make sure it's clear why we do certain things here and there. We also kill `getCvtOrder`, as it was not used anywhere (cherry picked from commit 56584c4) * [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI. (cherry picked from commit 38a11b8) * [AMD] Reland instruction scheduling hint changes (triton-lang#4940) This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests (cherry picked from commit ee5876c) * [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029) (cherry picked from commit f062540) * [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034) This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c) * [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030) Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68) * [BACKEND] Get rid of unpack/pack I32 (triton-lang#5044) - Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`. (cherry picked from commit 1cf7b1b) (cherry picked from commit 376fe7e) * Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032) --------- Co-authored-by: Gary Geng <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Mario Lezcano Casado <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: ravil-mobile <[email protected]>

Jokeren marked this pull request as ready for review October 21, 2024 02:37

Jokeren requested a review from ptillet as a code owner October 21, 2024 02:37

Jokeren requested review from antiagainst, lezcano and ThomasRaoux October 21, 2024 02:40

Jokeren requested a review from zhanglx13 October 21, 2024 02:43

Jokeren mentioned this pull request Oct 21, 2024

[AMD] MFMA Dot operand to LinearLayout conversion #4961

Merged

lezcano reviewed Oct 21, 2024

View reviewed changes

lezcano approved these changes Oct 23, 2024

View reviewed changes

lezcano mentioned this pull request Oct 23, 2024

LHS Registers Part 1 - DotOp Hoisting and SMEM-RF Copy Lowering openxla/triton#18

Closed

parent 15c5e55

d04e872

author Jokeren <[email protected]> 1729209426 -0400 committer Jokeren <[email protected]> 1729880369 -0400 Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update Update

Jokeren force-pushed the keren/ll-shortcut branch from 251d656 to d04e872 Compare October 25, 2024 18:23

Jokeren added 14 commits October 25, 2024 14:31

Update

a41d6c3

Update

0429cda

Update

9ec0f6d

Update

df4d7c8

Update

e498201

Update

6737e8b

Update

c67e094

Update

460545a

Update

41bb139

Update

10cc476

Update

a466171

Update

288848e

Update

fe496e8

Update

ac68b55

lezcano approved these changes Oct 28, 2024

View reviewed changes

lib/Conversion/TritonGPUToLLVM/ConvertLayoutOpToLLVM.cpp Show resolved Hide resolved

Jokeren merged commit 1d5fdfe into main Oct 28, 2024
7 checks passed

Jokeren deleted the keren/ll-shortcut branch October 28, 2024 15:08

AlexAUT pushed a commit to AlexAUT/triton that referenced this pull request Oct 29, 2024

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

0ae4fc4

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

e0a4680

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

e51afd5

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

jataylo mentioned this pull request Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Merged

jataylo mentioned this pull request Dec 13, 2024

[CP] Many layout cherry picks for gemm perf ROCm/triton#683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic #4951

[BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic #4951

Jokeren commented Oct 18, 2024 •

edited

Loading

Jokeren commented Oct 21, 2024

Jokeren commented Oct 21, 2024

lezcano Oct 21, 2024

lezcano Oct 21, 2024

lezcano left a comment

lezcano left a comment

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic #4951

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic #4951

Conversation

Jokeren commented Oct 18, 2024 • edited Loading

Jokeren commented Oct 21, 2024

Jokeren commented Oct 21, 2024

lezcano Oct 21, 2024

Choose a reason for hiding this comment

lezcano Oct 21, 2024

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

[BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic #4951

[BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic #4951

Jokeren commented Oct 18, 2024 •

edited

Loading