Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts #5089

lezcano · 2024-11-06T23:31:23Z

This partially reverts commit 38a11b8.
Supersedes #5085

It also documents that we are implicitly choosing a way to tile a
full tensor depending on the layout. See #5085 (comment)

lezcano · 2024-11-06T23:35:58Z

include/triton/Dialect/TritonGPU/IR/TritonGPUAttrDefs.td

@@ -474,6 +474,9 @@ layout = [0  4  8  12]
         [3  7  11 15]

 For the Threads Per Warp and Values Per Thread level, the linear id distribution is variant for each sub-class encoding.
+
+If the layout does not completely cover the tensor, we tile it until we cover the entire tensor.
+The order of this tiling is, incidentally, the same as the order of the elements, although this may change in the future.


Is this true for AMD layouts @zhanglx13? If not, what can we say about the tiling order of distributed layouts in general?

Unfortunately, this is not the case for AMDMfma layout when isTransposed=False.
How about we say "the order of this tiling is the same as the order of how warps are distributed in each CTA"?

Sadly, that wouldn't be correct for wgmma, for which the warps are in M-major order (due to the warp-group requirement) and then it's packed on an N-major order

triton/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/WGMMA.cpp

Lines 423 to 424 in 1070ca2

loadReg(rewriter, loc, fc, (m * numRepN + n) * accSize, accSize,

startSequence);

Also that it's not very clear what's "the order" of the warps in the DotOperandEncoding when they broadcast (you can define the order, it's just not the most intuitive notion in the world though)

I think that the best we can do here is to just add that AMDMfma(isTransposed=False) is an exception
A better long-term solution would be to add a method getRepOrder (as we use the term rep throughout the codebase like in getRepForOperand) that returns this property and, at the very least, use it when converting to LinearLayouts and assert that it has the expected value when doing the lowering of the mma* / mfma* instructions

Actually, I implemented this in 522bce0 for all but AMD layouts. I'll let the AMD guys implement it for theirs if they deem this useful.

Agreed that getRepOrder is much more robust. We'll work on the implementation for AMD layout in a future PR. Thanks @lezcano

… order" This partially reverts commit 38a11b8. It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See #5085 (comment)

oplavsic · 2024-11-07T12:26:07Z

@lezcano Can you add a lit test from my PR here, to make sure store vectorization happens and so we don't break it again accidentally in the future?

zhanglx13

LGTM

lezcano · 2024-11-07T20:19:53Z

Added the lit test!

… for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment)

… for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3)

… for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032)

@lezcano

* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09) * [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994) This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`. (cherry picked from commit 3549db8) * [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks) (cherry picked from commit 23c9ec1) * [BACKEND] Improve detection of register to register conversion (triton-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55) * [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe) * [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038) We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary functions to make the intentions a bit clearer. We add a few asserts in the `LinearLayoutConversion` to make sure it's clear why we do certain things here and there. We also kill `getCvtOrder`, as it was not used anywhere (cherry picked from commit 56584c4) * [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI. (cherry picked from commit 38a11b8) * [AMD] Reland instruction scheduling hint changes (triton-lang#4940) This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests (cherry picked from commit ee5876c) * [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029) (cherry picked from commit f062540) * [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034) This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c) * [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030) Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68) * [BACKEND] Get rid of unpack/pack I32 (triton-lang#5044) - Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`. (cherry picked from commit 1cf7b1b) (cherry picked from commit 376fe7e) * Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032) --------- Co-authored-by: Gary Geng <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Mario Lezcano Casado <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: ravil-mobile <[email protected]>

lezcano requested review from Jokeren and ptillet as code owners November 6, 2024 23:31

lezcano mentioned this pull request Nov 6, 2024

Use getOrder instead of getThreadOrder in AxisInfo.cpp #5085

Closed

lezcano commented Nov 6, 2024

View reviewed changes

lezcano force-pushed the fix_order branch from 46a2c67 to 1f949b8 Compare November 7, 2024 08:40

Consolidate getOrder as "element order" and document implicit "tile…

c726428

… order" This partially reverts commit 38a11b8. It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See #5085 (comment)

lezcano force-pushed the fix_order branch from 1f949b8 to c726428 Compare November 7, 2024 08:41

Implement getRepOrder

522bce0

lezcano changed the title ~~Consolidate getOrder as "element order" and document implicit "tile order"~~ Consolidate getOrder as "element order" and implement getRepOrder for general and NVIDIA layouts Nov 7, 2024

zhanglx13 approved these changes Nov 7, 2024

View reviewed changes

add lit test

2cfaae8

lezcano enabled auto-merge (squash) November 7, 2024 20:19

lezcano merged commit 57643b3 into main Nov 7, 2024
7 checks passed

lezcano deleted the fix_order branch November 7, 2024 20:36

jataylo mentioned this pull request Nov 8, 2024

[triton] Update pin for PyTorch 2.6/Triton 3.2 pytorch/pytorch#139206

Closed

jataylo mentioned this pull request Dec 13, 2024

[CP] Many layout cherry picks for gemm perf ROCm/triton#683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts #5089

Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts #5089

lezcano commented Nov 6, 2024 •

edited

Loading

lezcano Nov 6, 2024

zhanglx13 Nov 7, 2024

lezcano Nov 7, 2024 •

edited

Loading

lezcano Nov 7, 2024

zhanglx13 Nov 7, 2024

oplavsic commented Nov 7, 2024

zhanglx13 left a comment

lezcano commented Nov 7, 2024

	loadReg(rewriter, loc, fc, (m * numRepN + n) * accSize, accSize,
	startSequence);

Consolidate getOrder as "element order" and implement getRepOrder for general and NVIDIA layouts #5089

Consolidate getOrder as "element order" and implement getRepOrder for general and NVIDIA layouts #5089

Conversation

lezcano commented Nov 6, 2024 • edited Loading

lezcano Nov 6, 2024

Choose a reason for hiding this comment

zhanglx13 Nov 7, 2024

Choose a reason for hiding this comment

lezcano Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

lezcano Nov 7, 2024

Choose a reason for hiding this comment

zhanglx13 Nov 7, 2024

Choose a reason for hiding this comment

oplavsic commented Nov 7, 2024

zhanglx13 left a comment

Choose a reason for hiding this comment

lezcano commented Nov 7, 2024

Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts #5089

Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts #5089

lezcano commented Nov 6, 2024 •

edited

Loading

lezcano Nov 7, 2024 •

edited

Loading