[CP] Many layout cherry picks for gemm perf #683

jataylo · 2024-12-13T16:57:10Z

Picks:

[AMD] Reland instruction scheduling hint changes triton-lang/triton#4940
[BACKEND] Replace isMmaToDotShortcut with linear layout based logic triton-lang/triton#4951
[BACKEND] Improve detection of register to register conversion triton-lang/triton#4991
[AMD] Add initial support for scaled_dot(mxfp8, fp8) triton-lang/triton#4994
[Frontend][Backend] Implement support for scale_dot(-, bf16) triton-lang/triton#4996
[AMD] Enable scaled_dot(-, bf16) triton-lang/triton#5029
[BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 triton-lang/triton#5030
[AMD] Add support for scaled_dot(mxfp4, -) triton-lang/triton#5034
[BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion triton-lang/triton#5038
[BACKEND] Get rid of unpack/pack I32 triton-lang/triton#5044
[BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) triton-lang/triton#5055
Consolidate getOrder as "element order" and implement getRepOrder for general and NVIDIA layouts triton-lang/triton#5089

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)

This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`. (cherry picked from commit 3549db8)

…lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks) (cherry picked from commit 23c9ec1)

…n-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55)

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

) We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary functions to make the intentions a bit clearer. We add a few asserts in the `LinearLayoutConversion` to make sure it's clear why we do certain things here and there. We also kill `getCvtOrder`, as it was not used anywhere (cherry picked from commit 56584c4)

…riton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI. (cherry picked from commit 38a11b8)

This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests (cherry picked from commit ee5876c)

(cherry picked from commit f062540)

This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c)

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68)

- Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`. (cherry picked from commit 1cf7b1b) (cherry picked from commit 376fe7e)

… for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032)

ggengnv and others added 13 commits December 13, 2024 15:16

[BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton…

f8c2c30

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)

[BACKEND] Replace isMmaToDotShortcut with linear layout based logic (…

fc6d96b

…triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe)

[AMD] Enable scaled_dot(-, bf16) (triton-lang#5029)

ca75b5f

(cherry picked from commit f062540)

[AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034)

ac9f0d0

This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c)

jataylo marked this pull request as ready for review December 13, 2024 16:57

jataylo requested review from antiagainst and zhanglx13 as code owners December 13, 2024 16:57

jataylo merged commit 0b9b798 into ROCm:release/internal/3.2.x Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CP] Many layout cherry picks for gemm perf #683

[CP] Many layout cherry picks for gemm perf #683

jataylo commented Dec 13, 2024

[CP] Many layout cherry picks for gemm perf #683

[CP] Many layout cherry picks for gemm perf #683

Conversation

jataylo commented Dec 13, 2024