[BACKEND][NVIDIA] Add kWidth support in MMAv3 #5009

ggengnv · 2024-10-29T20:54:49Z

Allows for upcasting in DotOp encoding in RF.
This lowering path is not currently in use; pending #5003

ggengnv · 2024-10-29T20:55:34Z

@lezcano I've split the PR per your request. Please review, ty

ThomasRaoux · 2024-10-29T21:07:38Z

can you run pre-commit hook

ggengnv · 2024-10-29T21:58:28Z

can you run pre-commit hook

Just ran precommit hook locally and pushed.

ThomasRaoux

LGTM

lezcano

Great, thank you!

lezcano · 2024-10-30T08:11:16Z

...party/nvidia/lib/TritonNVIDIAGPUToLLVM/ConvertLayoutOpToLLVM/SharedToDotOperandMMAv2OrV3.cpp

+std::vector<Value> unpackInt(const std::vector<Value> &inValues, Type elTy,
+                             ConversionPatternRewriter &rewriter, Location loc,
+                             const LLVMTypeConverter *typeConverter) {


Note to self: This one can be removed, as the same function is in Utility.cpp in this folder.
I'll clean it up in a later PR, no action needed here.

ThomasRaoux · 2024-11-01T05:07:01Z

While discussing with @Jokeren I realized kwidth for mmav3 dot_operand doesn't make a lot of sense, this was introduced for mmav2 to be able to efficiently load and upscale inputs by re-ordering elements along k dimension.
This is not possible for mmav3 since one of the element has to come from shared memory, therefore re-ordering is not an option.
@ggengnv what was the original motivation for adding kwidth to mmav3 dot_operand? It feels like we cannot handle upcasting in mmav3 the way we do on mmav2 so I'm wondering what is your plan here.

lezcano · 2024-11-01T09:23:12Z

You are right in that for fp8 * fp16 it's true that you can just load everything as kWidth=2 and simply upcast.

Now, for fp4 * fp16, you load fp4s as uint8 using kWidth=1, and then, when you unpack them, they become kWidth=2.

From reading @ggengnv's PR, I am tempted to guess that their main motivation was to reuse the legacy code SharedToDot, which requires a kWidth argument, but I agree with @ThomasRaoux and @Jokeren that we don't need this path once LLs are full supported.

ggengnv · 2024-11-01T16:17:46Z

@ThomasRaoux My understanding is that before, kWidth for Hopper could always be deduced as kWidth = 32 / elemBitWidth. We can no longer do that now because we're doing upcasting in DotOp layout in registers, and the bitwidth at loading may be different from the bitwidth at MMA.

So for Ampere fp8 * fp16, A and B will both have kWidth = 4, and we need to specify kWidth for the sake of B, which usually has kWidth = 32 / 16 = 2.
For Hopper fp8 * fp16, A and B will both have kWidth = 2, and we need to specify kWidth for the sake of A, which usually has kWidth = 32 / 8 = 4. This kWidth value is set in OptimizeDotOperands.

As @lezcano said, with kWidth correctly specified, I'm able to reuse the logic in SharedToDotOperandMMAv2, which depends on kWidth for address calculations. In other words I reused the existing MMAv2 mechanism to distinguish a "normal" DotOp layout (4xfp8) from an "upcast' DotOp layout (2xfp8).

Let me know if my understanding above is correct.

@lezcano

Two bugfixes following #5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9)

…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9)

@lezcano

Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68)

@lezcano

* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09) * [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994) This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`. (cherry picked from commit 3549db8) * [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks) (cherry picked from commit 23c9ec1) * [BACKEND] Improve detection of register to register conversion (triton-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55) * [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe) * [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038) We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary functions to make the intentions a bit clearer. We add a few asserts in the `LinearLayoutConversion` to make sure it's clear why we do certain things here and there. We also kill `getCvtOrder`, as it was not used anywhere (cherry picked from commit 56584c4) * [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI. (cherry picked from commit 38a11b8) * [AMD] Reland instruction scheduling hint changes (triton-lang#4940) This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests (cherry picked from commit ee5876c) * [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029) (cherry picked from commit f062540) * [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034) This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c) * [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030) Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68) * [BACKEND] Get rid of unpack/pack I32 (triton-lang#5044) - Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`. (cherry picked from commit 1cf7b1b) (cherry picked from commit 376fe7e) * Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032) --------- Co-authored-by: Gary Geng <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Mario Lezcano Casado <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: ravil-mobile <[email protected]>

ggengnv requested a review from ptillet as a code owner October 29, 2024 20:54

ggengnv mentioned this pull request Oct 29, 2024

[BACKEND][NVIDIA] Add DotOp Hoisting Pass for WGMMA and Add Lowering for SMEM-to-MMAv3 DotOp Copy #5003

Merged

ggengnv added 4 commits October 29, 2024 14:59

Add lowering logic for Shared-to-MMAv3-DotOp copy

0c4807b

Add lowering test for kWidth != 32 / elemBits

3262ce8

Move WGMMA documentation

a459de9

Run pre-commit hook

9f4883e

ggengnv force-pushed the hopper-kwidth branch from 9265991 to 9f4883e Compare October 29, 2024 21:59

ThomasRaoux approved these changes Oct 29, 2024

View reviewed changes

lezcano approved these changes Oct 30, 2024

View reviewed changes

lezcano merged commit cfddb09 into triton-lang:main Oct 30, 2024
7 checks passed

lezcano changed the title ~~[BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy~~ [BACKEND][NVIDIA] Add kWidth support in MMAv3 Oct 30, 2024

ggengnv mentioned this pull request Nov 1, 2024

[BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 #5030

Merged

gflegar mentioned this pull request Nov 8, 2024

warp_group_dot lowering crashes for specific instruction shape #5102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BACKEND][NVIDIA] Add kWidth support in MMAv3 #5009

[BACKEND][NVIDIA] Add kWidth support in MMAv3 #5009

ggengnv commented Oct 29, 2024

ggengnv commented Oct 29, 2024

ThomasRaoux commented Oct 29, 2024

ggengnv commented Oct 29, 2024

ThomasRaoux left a comment

lezcano left a comment

lezcano Oct 30, 2024

ThomasRaoux commented Nov 1, 2024

lezcano commented Nov 1, 2024

ggengnv commented Nov 1, 2024

[BACKEND][NVIDIA] Add kWidth support in MMAv3 #5009

[BACKEND][NVIDIA] Add kWidth support in MMAv3 #5009

Conversation

ggengnv commented Oct 29, 2024

ggengnv commented Oct 29, 2024

ThomasRaoux commented Oct 29, 2024

ggengnv commented Oct 29, 2024

ThomasRaoux left a comment

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

lezcano Oct 30, 2024

Choose a reason for hiding this comment

ThomasRaoux commented Nov 1, 2024

lezcano commented Nov 1, 2024

ggengnv commented Nov 1, 2024