Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) #5055

Merged
merged 7 commits into from
Nov 5, 2024

Conversation

lezcano
Copy link
Contributor

@lezcano lezcano commented Nov 4, 2024

We use getOrder very liberally throughout the codebase, when we really
meant to use getThreadOrder. This is an issue with the input layout is an
DotOperand(mma(opIdx=1)), where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the getOrder of DotOperand(Hopper) to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use getOrder, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled.

This is particularly problematic for DotOperand(opIdx=1) in nvidia hardware, as getThreadOrder and getOrder are different!

While doing so we:

  • Audit most (all?) the calls to getOrder(dotOperand). It turns out that most of them really meant getThreadOrder
  • Fix the ordering methods of SliceEncodingAttr to be consistent
  • Move the implementation of getWarpOrder to the Attr classes, because of OOP

The test strategy was to add llvm::report_fatal_error("Testing"); within getOrder(nvidiaMma) and getOrder(DotOperand(nvidiaMma)) and triaging all errors that were raised in CI.

@lezcano lezcano requested a review from ptillet as a code owner November 4, 2024 09:42
@lezcano lezcano changed the title Fix uses of getOrder [BACKEND] Fix uses of getOrder Nov 4, 2024
@lezcano lezcano marked this pull request as draft November 4, 2024 09:42
@lezcano lezcano changed the title [BACKEND] Fix uses of getOrder [BACKEND] [TESTING] Fix uses of getOrder Nov 4, 2024
@lezcano lezcano changed the title [BACKEND] [TESTING] Fix uses of getOrder [BACKEND] [TESTING] Fix uses of getOrder(DotOperand(Hopper)) Nov 4, 2024
We use getOrder very liberally throughout the codebase, when we really
meant to use `getWarpOrder` or `getThreadOrder`.
In this PR we aim to fix all these.
@lezcano lezcano closed this Nov 4, 2024
@lezcano lezcano reopened this Nov 4, 2024
@lezcano lezcano force-pushed the order-dot-hopper branch 3 times, most recently from bc55363 to f7a4ad7 Compare November 4, 2024 17:35
@lezcano lezcano changed the title [BACKEND] [TESTING] Fix uses of getOrder(DotOperand(Hopper)) [BACKEND] [TESTING] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) Nov 4, 2024
@lezcano lezcano marked this pull request as ready for review November 4, 2024 19:48
@lezcano lezcano changed the title [BACKEND] [TESTING] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) Nov 4, 2024
@lezcano
Copy link
Contributor Author

lezcano commented Nov 4, 2024

cc @antiagainst for visibility, as perhaps you want to do something similar for AMD's backend.

@@ -40,15 +40,16 @@ Value redundantDataMask(Type valueTy, ConversionPatternRewriter &rewriter,
auto sizePerThread = triton::gpu::getSizePerThread(layout);
auto threadsPerWarp = triton::gpu::getThreadsPerWarp(layout);
auto warpsPerCTA = triton::gpu::getWarpsPerCTA(layout);
auto order = triton::gpu::getOrder(layout);
auto threadOrder = triton::gpu::getThreadOrder(layout);
auto warpOrder = triton::gpu::getWarpOrder(layout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if you call getWarpOrder(dot operand layout) here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it seems just fine to still have getWarpOrder defined for dot operand layouts.

But it causes confusion for you, I will add a condition for dot operand layouts here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case it will break hard. That is good, because incidentally delinearize would not work as expected. Before it would create incorrect results, now at least it breaks hard.

Note that there is no combination of warpsPerCTA and warpOrder that can represent the order of DotOperands, so I think it's better to break hard,

@lezcano lezcano merged commit 38a11b8 into main Nov 5, 2024
7 checks passed
@lezcano lezcano deleted the order-dot-hopper branch November 5, 2024 08:29
@antiagainst
Copy link
Collaborator

Still need more time to figure it out but this seems trigger a large regression for the AMD backend. Just an early heads-up.

antiagainst added a commit to antiagainst/triton that referenced this pull request Nov 5, 2024
@lezcano
Copy link
Contributor Author

lezcano commented Nov 5, 2024

It must be that either (or both):

  • Some threadorder is not correct for some layout
  • and/or some getOrder must be changed into getThreadOrder in the duplicated amd code

@lezcano
Copy link
Contributor Author

lezcano commented Nov 5, 2024

Also, as discussed in the OP, this is how I found all this:
2eb481d
but also, I was sure that getThreadOrder was correct for all nvidia layouts. This was done in a previous PR.

Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 19, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)
jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 19, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 12, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)
jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024
…riton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)
jataylo added a commit to ROCm/triton that referenced this pull request Dec 13, 2024
* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009)

Allows for upcasting in DotOp encoding in RF.
This lowering path is not currently in use; pending
triton-lang#5003

(cherry picked from commit cfddb09)

* [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994)

This commit adds initial support for scaled_dot with
mxfp8 LHS and fp8 RHS. It supports both mfma32
and mfma16 intrinsic variants.

Right now we are missing software emulation for
`Float8E4M3FN` type, so this only enables for
`Float8E5M2`.

(cherry picked from commit 3549db8)

* [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996)

In the passing we also improve a few other things:
- Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before
you had to cast it to uint8, which was weird when extending it to bf16).
- Add `scaled_dot` to the docs and improve the docs overall (have not
render them, might need a few further tweaks)

(cherry picked from commit 23c9ec1)

* [BACKEND] Improve detection of register to register conversion (triton-lang#4991)

Specifically, it fixes problems when `srcLayout` and `dstLayout` have
different number of registers but the same number of not free registers.
We solved the problem by padding free registers to either `srcLayout` or
`dstLayout`, but this can be improved by fixing the `invertAndCompose`
function.

(cherry picked from commit 15c5e55)

* [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951)

This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion.

(cherry picked from commit 1d5fdfe)

* [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038)

We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary
functions to make the intentions a bit clearer.

We add a few asserts in the `LinearLayoutConversion` to make sure it's
clear why we do certain things here and there.

We also kill `getCvtOrder`, as it was not used anywhere

(cherry picked from commit 56584c4)

* [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055)

We use `getOrder` very liberally throughout the codebase, when we really
meant to use `getThreadOrder`. This is an issue with the input layout is
an
`DotOperand(mma(opIdx=1))`, where the thread order and the matrix order
are opposite.

Found this to be an issue when a PR changed the `getOrder` of
`DotOperand(Hopper)` to an incorrect one and CI still passed! The issue
here is that the LLVM lowering for wgmma and the LinearLayout does not
use `getOrder`, but there are many other subsystems do, and many
heuristics would be getting an incorrect order, and potentially be
disabled.

This is particularly problematic for `DotOperand(opIdx=1)` in nvidia
hardware, as `getThreadOrder` and `getOrder` are different!

While doing so we:
- Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out
that most of them really meant `getThreadOrder`
- Fix the ordering methods of `SliceEncodingAttr` to be consistent
- Move the implementation of `getWarpOrder` to the Attr classes, because
of OOP

The test strategy was to add `llvm::report_fatal_error("Testing");`
within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and
triaging all errors that were raised in CI.

(cherry picked from commit 38a11b8)

* [AMD] Reland instruction scheduling hint changes (triton-lang#4940)

This commit relands triton-lang#4819
with the following fixes:

* Changed to a better way to mark opIdx for loads
* Replaced temlate-based `rewindUnaryOps` to use regular
  for-loops. The new way is more robust and can handle other
  unary ops automatically.
* Replaced `instr.sched.barriers` using the ones from
  `rocdl` dialect from the MLIR upstream
* Extended lit tests

(cherry picked from commit ee5876c)

* [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029)

(cherry picked from commit f062540)

* [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034)

This commit adds support for mxfp4 typed A tensor
for sacled dot in the AMD backend.

We moved the `convertMxfp4x2ToBf16x2` impl
from NVIDIA side to a common path to reuse.

(cherry picked from commit edc5c5c)

* [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030)

Two bugfixes following triton-lang#5009.

- When `BLOCK_M=64` and `num_warps > 4`, the order of warps for
DotOpEncoded tensor should be M-major instead of N-major, since WGMMA
expects the 4 warps in each warp group to be stacked along the M
dimension.
- Should use `mmaBitwidth` instead of `bitwidth` when calculating
`numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad
rebase.

@lezcano I encountered these bugs when attempting to locally test the
[DotOp hoisting PR](triton-lang#5003)
after rebasing (they normally would be caught by `test_core.py` but that
path was not yet enabled in the last PR). With these fixes added, I was
able to successfully validate against pytorch.

(cherry picked from commit e82dfd9)
(cherry picked from commit 5287a68)

* [BACKEND] Get rid of unpack/pack I32  (triton-lang#5044)

- Removed functions related to unpacking and packing I32 values.
- Updated utilities to handle conversion of mxfp4 values without
packing/unpacking I32.
- Move the register value ordering logic from the element-wise operation
lowering to the dot operation lowering.
- Use linear layout to handle conversions between almost all distributed
layouts.
- Clean up data loading and mma computation involving `repN`, `repK`,
and `repM`.

(cherry picked from commit 1cf7b1b)
(cherry picked from commit 376fe7e)

* Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089)

This partially reverts commit 38a11b8.
Supersedes triton-lang#5085

It also documents that we are implicitly choosing a way to tile a
full tensor depending on the layout. See
triton-lang#5085 (comment)

(cherry picked from commit 57643b3)
(cherry picked from commit ffb2032)

---------

Co-authored-by: Gary Geng <[email protected]>
Co-authored-by: Lei Zhang <[email protected]>
Co-authored-by: Mario Lezcano Casado <[email protected]>
Co-authored-by: Keren Zhou <[email protected]>
Co-authored-by: ravil-mobile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants