-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BACKEND][NVIDIA] Add DotOp Hoisting Pass for WGMMA and Add Lowering for SMEM-to-MMAv3 DotOp Copy #5003
Conversation
@lezcano I've transferred the previous PR here per your request. Also @Jokeren I've addressed some of your comments in the latest commit. The remaining comments I've responded to, asking for clarification or explaining why I think changes won't be needed. We can discuss further, and I'll add changes to this PR from now on. One thing that needs to be resolved before merge is that this PR is only the first part of my changes. The second PR in the XLA fork includes optimizations that may be necessary to see perf gains. In addition, even with the second PR's changes, we don't see perf gain for all shapes. It looks like that for some smaller shapes we currently see perf loss, and so further optimizations (or heuristics to enable/disable hoisting) may be necessary. I'm not sure what the merge strategy is for these kinds of larger changes, so please advise on this, thanks :) |
f47f5d6
to
a08b09b
Compare
Perf loss is concerning to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
One thing I wonder is how the wgmma pipelining will work in this case.
PTX spec says:
Accessing the accumulator register or the input register containing the fragments of matrix A of a wgmma.mma_async instruction without first performing a wgmma.wait_group instruction that waits on a wgmma-group including that wgmma.mma_async instruction is undefined behavior.
Therefore when we do wgmma pipelining with operand coming from register we are going to break this rule as we would keep re-using the same register for A operand.
I believe ptxas will see that and fallback but that is likely to cause significant performance problems.
Is this something we need to handle? Do you know how libraries handle it?
|
||
// Analog of canHoistDotOpEncV2, but for MMAv3 (WGMMA where operand A | ||
// is in registers). | ||
bool canHoistDotOpEncV3(Operation* op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this any different than canHoistDotOpEncV2
? I would expect it to be the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're almost the same; the only difference is that MMAv3 hoisting doesn't support downcasting yet, because the lowering of shared-to-MMAv3-dotop-copy logic that I added doesn't yet support it.
For v3, I also added stricter checks like
// Must have exactly one result and at least one operand
if (op->getNumOperands() == 0 || op->getNumResults() != 1)
return false;
// Operands and results must be of RankedTensorType and Blocked or DotOp
if (!(all_of(op->getOperandTypes(), isBlockedOrDotOpRankedTensor) &&
all_of(op->getResultTypes(), isBlockedOrDotOpRankedTensor)))
return false;
but left the v2 one intact in case something breaks.
// First pass: clone ops; the result values are cloned as well, but the operands still | ||
// refer to the original result values | ||
for (Operation *op : slice) { | ||
auto newOp = rewriter.clone(*op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the clone op be inserted right before the old op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean that I should use setInsertionPoint
before the clone
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah? To avoid pulling all the operations down to the dot
} | ||
|
||
// Step 5b: Change the result to have DotOp rather than Blocked encoding | ||
auto resTy = dyn_cast<RankedTensorType>(op->getResult(0).getType()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use cast
when you know that the cast should succeeded
// In case LHS is in registers, don't pipeline for now TODO(ggengnv) is this necessary? | ||
auto op = *alloc->getUsers().begin(); | ||
if (auto localLoad = dyn_cast<ttg::LocalLoadOp>(op)) { | ||
auto resTy = cast<RankedTensorType>(localLoad->getResultTypes()[0]); | ||
if (!resTy || isa<ttg::DotOperandEncodingAttr>(resTy.getEncoding())) | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this sounds like it will be inefficient? Why do we need that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this isn't necessary and I forgot to delete this TODO - I originally had pipelining logic enabled in this PR but there were concerns that this PR was getting too large, so I separated it into another PR.
If you think it's more natural, I can add back the pipelining logic into this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I just saw your other comment. I can keep the changes separate for now and combine them before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding splitting the PR, I think the PR could be split between the changes to kWidth
, which are simple and benevolent, and the hoisting + pipelining logic, which seem a bit trickier. @ggengnv how does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me.
by changes to kWidth
, you mean the lowering of shared to register copies for MMAv3, correct?
the changes should be able to be split cleanly. only thing is that it might be hard to test by itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything but the hoisting pass, yes.
When it comes to testing, you can add a couple lit tests that exercise kWidth != 4 / elemSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but yeah, since these changes are orthogonal splitting it in two shouldn't be terribly difficult. Tag me as a reviewer once you put it up and I'll approve it. That should make this PR much more lean, and it should be then fine to merge this PR and the pipeline PR into a single manageable PR
// This ordering is decided when a tensor in DotOpEnc is lowered into llvm. | ||
// For WGMMA this happens in both SharedToDotOperand and MMAToDotOperand. | ||
// Thus, both lowerings must obey this above ordering for the below code to be correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision should be based on the layout definition rather than a convention between different lowering. This comment is a bit misleading and maybe we should more explicitly describe the layout instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's currently nothing in the dot operand layout attributes that would indicate the ordering of matM and matK though, so I assumed it was just implicit logic. I could move this comment to the definition of DotOpEncoding or perhaps remove it altogether to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes the layout is not well documented and/or defined but this is how it should work :) I think moving it to DotOpEncoding is good, this is still valuable in my opinion
What workload did you use to measure this? If we need both PRs to avoid perf loss we should merge them together, they can be reviewed separately but we should avoid creating perf regressions |
You're right -- I believe the pipelining logic that I currently have in the other PR suffers from this issue. I believe CUTLASS uses more than one RF "buffer" to ping-pong the loads, so that's something I should look into.
I did simple mixed-precision GEMMs on various shapes with one or more dimensions being small (<4k). But now that you bring up the issue with pipelining, I believe I should fix the pipelining logic first before benchmarking again. The perf loss might disappear then. |
elems.push_back(vals.at({b, 2 * m + 1, 2 * k + 1})); | ||
} | ||
for (int k = 0; k < n1; ++k) | ||
if (isHopper) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's built on top of bad base
Line 514 in 86a2ac7
for (int k = 0; k < n1; ++k) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, made a mistake there. Just reverted -- please ignore :)
interesting, I wonder how that's possible unless we unroll the loop? |
afaik cutlass subdivides each block ("ktiles" in cutlass terminology) into subblocks ("kblocks"). So while TMA operates on the granularity of blocks, each WGMMA instruction will handle only a subblock. The inner loop of iterating over the subblocks of each block is unrolled, and the shared-to-register copies and WGMMAs are interleaved. I'm not fully certain if it's a ping-pong buffer or something else -- in the process of confirming it |
interesting, thanks for the information. It would probably be tricky to implement it that way in triton. I think there are other strategy we can apply to respect ptx spec. We could have the wgmma_wait at the end of the loop or we could place it before the a operand is set and add some IR changes to make sure nobody re-orders things incorrectly. If this is the main blocker for this PR to land I would suggest having this pass turned off by default so that the infra can be pushed. Then we can work on wgmma pipelining changes to make this right and ideally performant. |
Per @lezcano's suggestion I've first split out the dotOp lowering changes (which is a substantial part of this PR): #5009 After that PR's merged, I think there's still an edge case I need to resolve. But sg -- I can turn off this pass by default after everything else's resolved. |
Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending #5003
Two bugfixes following #5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.
dceb453
to
e9217d1
Compare
Update: cherrypicked pipelining changes into this PR.
|
// Hopper may not contain 32b along kWidth; Ampere always does | ||
int kBits = 8 * elemBytes * kWidth; | ||
assert(kBits == 32 || isHopper); | ||
int vecSize = kBits / canonBits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non-transposed case, vecSize = kBits / canonBits
was always 1, so the previous logic (isHopper ? 1 : ...)
worked.
for transposed case, vecSize
can be more than 1, since we're loading each value separately (i.e. canonBits = 8 * elemBytes
).
31b233e
to
b5c407f
Compare
@ThomasRaoux I have obtained some benchmark results. cc @lezcano All GEMMs are bf16 x int8, both operands k-major. The numbers are in us
You can notice that for the last three rows as well as some other small shapes (i.e. all shapes with small M), it's faster to just do AxB, i.e. not swap operands. These cases just use Ampere MMAv2 and the LHS optimization wouldn't apply. |
you meant A and B are swapped right? |
oh yes, that was a typo. |
why are the result different for those 3 rows between BxA and BxA RS? I thought when using MMAv2 the PR would have no effect? |
BxA and BxA RS are still with the operands swapped, so the PR would apply. AxB is the one without swapped; hence there's only one column for AxB. AxB is faster but I included BxA and BxA RS for reference anyway. |
But it wouldn't use wgmma even with the operands swapped? I thought the code in the PR would not change this case |
717941d
to
20f9ba0
Compare
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few more mostly minor comments
patterns.add<SwizzleShmemConvert>(context); | ||
if (this->hoistLayoutConversion.getValue()) | ||
if (this->hoistLayoutConversion.getValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: following MLIR style we usually don't have braces here
auto convertBlockLayout = [&](Value val, ttg::BlockedEncodingAttr enc) { | ||
auto ty = cast<RankedTensorType>(val.getType()); | ||
auto newTy = RankedTensorType::get(ty.getShape(), ty.getElementType(), enc); | ||
auto cvt = builder.create<ttg::ConvertLayoutOp>(loc, newTy, val); | ||
return cvt.getResult(); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you merge this helper with the one line 137
// If the following are true... | ||
// 1) Operand A is for WGMMA and is to be loaded in registers | ||
// 2) We upcast operand A in registers before the WGMMA | ||
// (downcasting is not yet supported) | ||
// | ||
// ...then the SharedEncoding vec will be less than BlockedEncoding's | ||
// sizePerThread for k-dim. E.g. if shared vec is 8 and sizePerThread | ||
// for k is 16, then AsyncCopyGlobalToLocal will generate two 8B-LDGSTS's | ||
// for each contiguous 16B global data owned by each thread. This breaks | ||
// coalescing (i.e. results 2x the minimum required transactions) | ||
// | ||
// The fix is to clip the BlockedEnc's sizePerThread using SharedEnc's vec. | ||
auto tensorTy = cast<RankedTensorType>(src.getType()); | ||
auto blockEnc = cast<ttg::BlockedEncodingAttr>(tensorTy.getEncoding()); | ||
auto sharedEnc = cast<ttg::SharedEncodingAttr>(allocTy.getEncoding()); | ||
auto sharedVec = sharedEnc.getVec(); | ||
|
||
SmallVector<unsigned> newSizePerThread; | ||
llvm::transform(blockEnc.getSizePerThread(), | ||
std::back_inserter(newSizePerThread), | ||
[&](auto size) { return std::min(size, sharedVec); }); | ||
|
||
if (newSizePerThread != blockEnc.getSizePerThread()) { | ||
auto mod = loadOp->getParentOfType<ModuleOp>(); | ||
int numWarps = ttg::TritonGPUDialect::getNumWarps(mod); | ||
int threadsPerWarp = ttg::TritonGPUDialect::getThreadsPerWarp(mod); | ||
auto newBlockEnc = ttg::BlockedEncodingAttr::get( | ||
loadOp.getContext(), tensorTy.getShape(), newSizePerThread, | ||
blockEnc.getOrder(), numWarps, threadsPerWarp, | ||
blockEnc.getCTALayout()); | ||
|
||
src = convertBlockLayout(src, newBlockEnc); | ||
if (mask) | ||
mask = convertBlockLayout(mask, newBlockEnc); | ||
if (other) | ||
other = convertBlockLayout(other, newBlockEnc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks like an overfit on a special cases, there are many cases where this would not be beneficial. This is probably some optimization we want to do later in the flow with a more robust heuristic. Can this be done in a separate PR and as a separate optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE - I copied the wrong row shapes for the below table. The last two rows should be 4096, 4096, 4096 and 8192, 8192, 8192
Sure, I've removed this for now. I did tentative measurements for a few shapes without this above optimization:
(in microseconds)
Shape | Main | Current PR w/ this optimizations | Current PR w/o this optimization |
---|---|---|---|
8, 8192, 8192 | 31 | 28 | 29 |
2048, 2048, 2048 | 340 | 290 | 332 |
4096, 4096, 4096 | 2626 | 2459 | 2625 |
So while this optimization is helpful, without it, we still see slight gains over baseline.
I can start working on a separate PR, although I'm trying to think how I can make this logic more general. If shared vec < blocked sizePerThread, I'm not sure if there's a way to fix coalescing other than reducing sizePerThread for the blocked encoding. Do you have specific cases in mind where this would be harmful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I think reducing sizePerThread is fine, I think what we want to avoid is, doing to much in the pipeliner and making assumptions on how the IR is just based on the fact that the loads feed into mma through register.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, makes sense. in terms of implementation, should I create a separate pass after the pipeliner? (I'm not sure it makes sense to be put into any of the existing passes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a friendly ping for advice on where to best place this optimization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay.
This problem already exists if we don't pipeline and don't use async.cp right? If that's the case I think the best place would be to do it as an improvement of the coalescing pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only observed this with async copy actually.
In the int8 example where shared-vec == 8 and blocked-sizePerThread == 16, if pipelining is disabled, then loading is done with LDG.128
(128 = 16 * 8 bits), the data is cast to f16, and storing is done with STS.128
(128 = 8 * 16bits).
If pipelining is enabled, the LDG
and STS
are coupled. Meaning we can't load 16 int8's in one 128-bit instruction. Two LDGSTS.64
are generated for the contiguous 128 bits, which breaks coalescing.
Hence, I was thinking this might make sense as a new pass somewhere after the pipeliner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I agree it should go in a pass after pipelining then. I can't think of a good place for this right now so I think making a new pass makes sense and we can move it around over time.
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
please run the pre-commit hook |
oh oops - just did. |
could this be a spurious error in the a100 integration test?
|
You can just rerun the test |
looks like there is some flakyness I'm trying to fix |
…for SMEM-to-MMAv3 DotOp Copy (triton-lang#5003) Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers). In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA. This PR adds an optimization for the case above to use RS GEMM. This requires the following changes: - In TritonGPU OptimizeDotOperands pass, add optimizations to change SS GEMM into RS GEMM. - Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3 dotOperand layout. NOTE: This may not see perf gain, and may even see perf loss, for certain shapes (e.g. small-K), and additional optimizations are in a separate [PR](openxla#19) (still more optimizations are WIP). Please advise on the merging strategy.
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)
#5222) This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See #5003 (comment) In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled, `AsyncCopyGLobalToLocal`'s src gmem blocked encoding will have `sizePerThread` > smem view's `vec` (along the contiguous dimension). This will resulting in multiple `cp.async` instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example. I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for `AsyncCopyGlobalToLocal` to calculate the max contiguous copy size. I compare that to the blockEnc's `sizePerThread` along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former. When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads. When A is m-major, this pass is a no-op because `copy size == sizePerThread == 16` ptal, thanks @ThomasRaoux
Update Update Update Update Update Use pytest' `tmp_path` in `test_irsource.py` (#5145) Signed-off-by: Anatoly Myachev <[email protected]> [TEST] Make mixed matmul test deterministic (#5151) This prevents surprises when some value may go above the tolerance threshold Fix `gtest_discover_tests` timeout argument (#5149) `gtest_discover_tests` runs the built unittest executable to create a distinct CMake target for every individual unittest in each executable. However, this was previously noted to time out on MacOS frequently (because MacOS scans newly built executables for viruses, or something...) but the timeout argument was incorrectly specified. [Triton] Remove upstream bug workaround (NFC) (#5152) Upstream handling of splatted bools in `DenseElementsAttr` was fixed, so the workaround can be removed when lowering `arith.constant` to TritonGPU. Co-authored-by: peterbell10 <[email protected]> [Triton] Generate local MLIR reproducers when possible (#5155) By setting a reproducer path, the pass manager will dump a standard MLIR reproducer before each pass manager invocation. This PR also enables additional local crash reproducer generation (to the same path set through the env var), which tries to narrow down the specific pass that failed, if the pass pipeline fails at any point. Revert "[AMD][Pipeliner] Improve clustering and add prefetch (#4881)" (#5157) This reverts commit cc25374 due to perf regressions. [IR] Add typing for tensor descriptor types (#5147) Currently tensor descriptors are just typed as `!tt.ptr<i8>` which is exposing the assumption it's using a TMA descriptor. This changes it to a custom type `!tt.tensordesc<tensor<...>>` which is lowered to a pointer type in the LLVM IR. I also add two new IR Ops which are used to cast between pointers and tensordesc objects. ```mlir tt.reinterpret_tensor_descriptor %ptr : !tt.ptr<i8> to !tt.tensordesc<...> triton_nvidia_gpu.tensor_desc_to_tma_ptr %desc : !tt.tensordesc<...> -> !tt.ptr<i8> ``` Really both of these should be nvidia-specific but the first is exposed in the triton IR to keep support for the by-value TMA descriptor API around while we figure out if it's possible to update to the new style. Load backend dialects in `IRSource` to make sure `parse_mlir_module` works for third_party backends (#5146) The changes from #4924 do not take into account the situation when `ttgir` level contains dialects defined in third_party plugins (at least that's my understanding). I'd also like to point out that the second use of `parse_mlir_module` function (via `parse` function call) happens after the dialects are loaded for the backend as well, which is why I thought my changes make sense. I hope this implementation will suit Triton, or maybe one can suggest other options. --------- Signed-off-by: Anatoly Myachev <[email protected]> [BACKEND][NVIDIA] Add DotOp Hoisting Pass for WGMMA and Add Lowering for SMEM-to-MMAv3 DotOp Copy (#5003) Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers). In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA. This PR adds an optimization for the case above to use RS GEMM. This requires the following changes: - In TritonGPU OptimizeDotOperands pass, add optimizations to change SS GEMM into RS GEMM. - Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3 dotOperand layout. NOTE: This may not see perf gain, and may even see perf loss, for certain shapes (e.g. small-K), and additional optimizations are in a separate [PR](openxla#19) (still more optimizations are WIP). Please advise on the merging strategy. Restore the CentOS 7 build (#5158) We likely need it for the PyTorch 2.6 release [BACKEND] Add folder for `addptr(ptr, 0) -> ptr` (#5166) I noticed this rather obvious pattern was missing. It might come up for example if you have an expression like: ```python ptrs = ptr + y_stride * tl.arange(0, YBLOCK)[:, None] ``` and the `YBLOCK` is set to 1 during autotuning. [TritonGPU] Fix incorrect mask operand used in for loop pipeliner (#5161) When the OOB values for a `tt.load` are non-zero, the for loop pipeliner needs to generate an `arith.select` to mask the loaded values with the default OOB value. However, if the load memory requires a layout change, the wrong mask operand was being passed to the `arith.select`, causing a shape mismatch. The fix is to just use the same mask operand of the origianl `tt.load` op. Fixes #4739 [BACKEND] Cleanup redundant broadcast combine pattern (#5167) Summary of changes: - Remove `broadcast(cst) -> cst` from the triton-combine pass since it's redundant with the existing folder. - Reorder the triton-combine pass to come after the canonicalize pass, to simplify pattern matching - Cleanup patterns in triton-reorder-broadcast that called `Op::canonicalize` in favor of `Op::getCanonicalizationPatterns`. [AMD] NFC: Drop duplicated moveUpTranspose (#5168) It was duplicated due to resolving merge conflicts. [Triton] Default diagnostic handler only filters for errors (#5173) A regular SourceMgrDiagnosticHandler is causing all remarks to be emitted even if the user doesn't ask for it! [AMD] Refactor instruction scheduling hints (#5144) - Renamed instruction scheduling variants - Enabled `buffer-ops` for `local-prefetch` - Added documentation regarding current variants --------- Co-authored-by: Lei Zhang <[email protected]> [AMD] Enable mixed precision matmul test (#5177) This commit enables mixed precision matmul test for AMD backend. For FP8 E4M3, we test `fp8e4m3fnuz` given that's natively supported on MI300 series. Update to llvm/llvm-project@bd9145c8c213 (#5180) This pulls in llvm/llvm-project@bd9145c8c213 to enable ASan on AMD backend. [AMD] Implement RepOrder for AMD MMA layouts (#5126) Implement RepOrder methods for MFMA and WMMA layouts. Both layouts have row major rep layout. Also, isTranspose flag in MFMA layout does not affect RepOrder, meaning RepOrder is row major in both cases. Co-authored-by: Ognjen Plavsic <[email protected]> [BACKEND] Fix ProgramPoint passing in AxisInfoAnalysis (#5181) Fixes #5122. The `ProgramPoint` [here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087) is created on the stack. Then its address is [passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089) to the MLIR `SparseAnalysis` code, where it is [added as a dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311) and later [dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90). By the time the `ProramPoint` is dereferenced in the `AbstractSparseForwardDataFlowAnalysis::visit`, the `AxisInfoAnalysis::visitForOpInductionVar` will have finished and the `ProgramPoint` stack variable destroyed. This leads to a segfault (which can be reproed on the base rev with the lit test added in this PR). The code modified in this PR was originally added in #4927, in conjunction with updating the `llvm-project` hash to `b5cc222d7429`. However, as noted in llvm/llvm-project#110344 (the `llvm-project` PR that has made the refactoring prompting the `AxisInfo.cpp` change in #4927): > For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by `getProgramPointAfter(op)` As the `AxisInfoAnalysis` (in Triton) inherits from `SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the above which resolves the segfault issue (as the `ProgramPoint` is now stored in the instance-level state of the pass). P.S. The lit test added in this PR is not exactly minimal. However, I did my best to minimize it starting from the 400-line repro TTGIR in [INTERPRETER] Fix argument passing for internal parameters in function declarations (#5169) [NFC] Use reference instead of copies in few places (#5118) Apply fixes suggested by coverity static analysis. Signed-off-by: Anatoly Myachev <[email protected]> [BACKEND] Add missing precondition in optimize acc init (#5184) We need scalar select to be able to do this optimization. [BACKEND] Fix accumulator init optimization for integer matmuls (#5192) [AMD][Pipeliner] Reland "Improve clustering and add prefetch" (#5175) This unreverts commit 38c6284 to reland #4881 with the following fixes: * Still keep `scheduleGlobalLoadLocalStore` as original--it turns to be not totally ready to replace yet. Further iteration on it needed. * Turn on `TRITON_HIP_STREAM_PREFETCH` if the instruction scheduling variant is `local-prefetch`, given it's needed there. --------- Co-authored-by: Lei Zhang <[email protected]> [AMD] Define an extract slice operation (#4804) This commit introduces an extract_slice operation for AMD backend to enable extracting slice of a tensor in registers without data exchange. It enables breaking down large tiles of tensors into smaller ones for better instruction interleaving and scheduling. This can be useful for hiding global memory latency when a global load/store can be efficiently split into several loads/stores to be overlapped with compute fo attention. [BACKEND] Fix getElemsPerThread for mmav3 dot operand (#5189) In mmav3 case the number of elements per threads should be independent of the element type, we should only consider kWidth. TODO: it should also be true for MMAv2 but the logic is a bit more complicated. Also enable larger block_m in mixed mode tests to exercise MMAv3 case [INTERPRETER][NFC] Rename `tensor_shape` -> `block_shape` in interpreter (#5195) `tensor_shape` is a confusing name and doesn't match block pointer's semantic. `block_shape` is much clearer. [LAYOUTS] Implement LL conversion for DotOperand(Hopper) (#5193) We also rewrite the way we implement DotOperand(Ampere) and mma Ampere to promote code reusing. I also started using what I believe is a rather compact pattern to write these things, where you first call `identiyND` with the `repOrder`, which gives you an LL with the dims in the correct order, and then you construct the final layout by specifying the tiles by multiplying `identity1D` maps. Using this allowed me to heavily simplify the handling of the `warps` of `DotOperand` which used to be a tad messy. Update README.md to remove triton conference (#5198) It happened two months ago [PROTON] Add `proton.state` utility (#5110) `state` is different from `scope` in several ways: 1. State is not recursive; each operation can have only a single state. Inner most state will overwrite the outer most state. 2. A states is a suffix, meaning that the original call path will append a state above the name of each kernel. 3. State is compatible with both Python and shadow contexts. [CI] remove unused inductor workflows (#5073) These tests have completely offloaded torch inductor tests to Meta a few months ago. They are currently disabled on GitHub. Signed-off-by: Sébastien Han <[email protected]> [INTERPRETER] Fix lower bound check for block pointers (#5201) We forgot to check `offset >= 0` previously. Now that it should match the semantic in the GPU backend https://github.com/triton-lang/triton/blob/7bce3613755e26953518962d02315dfd343dc50c/lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp#L136 [IR] Remove memdesc from `tt.trans` and implements `ttg.memdesc_trans` (#5194) [LLs] [BE] Simplify identityND (#5199) The auxiliary function `identityND` used to take an `order` parameter, that comes from triton, and a set of dimensions. Now, the order in triton is defined wrt. `dim0..dim<rank-1>`, so the dimension arg was redundant. This was quite confusing. We see that in all the uses of `identiyND`, we would pass the canonical dimensions, other than in one that we simply remove as it was not necessary. We remove the dims arg and simply return a layout with output dims `dim0..dim<rank-1>`. [MXFP] Fix packing for mxfp4 type (#5197) When packing we should have element 0 in the lower bits, until this PR it was in higher bits. [LAYOUTS] Unify the implementation of getShapePerCTA (#5183) We unify it and simplify its API (it was taking an unused `shape` parameter). While doing this, we found that the previous implementation was incorrect at least for `AMDWmmaEncodingAttr`, as this layout was using the shape parameter. Interestingly enough the doc in the header file for this function noted that the function is indeed independent of the tensor shape, even though the function does take a shape as an input! https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/include/triton/Dialect/TritonGPU/IR/Dialect.h#L113-L114 [BACKEND] Use the LL API to replace the using of legacy layout attribute API. (#5196) The util function `getDistributedLayoutStr` uses the `DistributedLayout` attribute interface, which is not flexible for third-party extensions. Use the `getInDimSize` of the `LinearLayout`, which is better since the legacy layout has been converted to the `LinearLayout`. There is no new test case since it is only a change in API usage. [CI] Fix ccache cache restoration to improve build times (#5202) This improves a warm-cache macOS build from ~25 mins to 2 mins. [CI] Fix `du` failling if cache restore fails (#5206) Follow up to #5202 It's currently failing with the error ``` du: /Users/runner/.triton/**: No such file or directory Error: Process completed with exit code 1. ``` which happens because even though the `.triton` directory exists, it is empty. This instead uses du on `.triton` with a depth of 1. [BACKEND][LAYOUT] Use LL for AMDMfma related layout conversions (#5210) [BUILD] Add option to limit number of parallel link jobs (#5212) [CI] Fix cache not saving (#5213) 1. [CI] Fix cache not saving Re-using the output of the cache restore step was recommended by the `actons/cache` docs, but it doesn't work here because we actually start from a clean cache when we run save so there is no output available to read. The annoyances of testing in the PR but main being a different environment. 2. Bump macOS timeout [LAYOUTS] Implement IR support for LinearLayouts (#5170) We also exercise this in scale_dot, where we enable support for warps of arbitrary shape (before we just allowed `[num_warps, 1]`). With this infra in place, it should be rather easy to move from the legacy layouts to using LLs to represent all of our layouts. Something I'm concerned about is the amount of recomputation that happens when calling methods like `getSizePerThread` and the like, where we keep recomputing the result. There might be an optimisation opportunity here where we cache the result of all these functions. We choose the IR representation of an LL via its canonical form + a `repOrder` for several reasons: - It's generally more compact - It's easier to CSE, so it's easier to see when two layouts are in fact the same. - A technical reason: the `toLinearLayout` function returns a tensor with dimensions `dim0, ..., dim<rank-1>`, in other words, it "forgets" the repetition order. Without the repetition order, we cannot recover the tile size of the argument. In particular, we cannot recover `getSizePerThread`. There is an argument to be made about whether `getSizePerThread` is useful on its own, or whether it is `getElemsPerThread` the real useful abstraction here, but for now, we keep both for BC. [CI] Run tests when CI is manually triggered (#5216) Currently you can manually call a workflow dispatch, but it won't actually run the tests because the variable enable_integration isn't set. [PROTON] Introduce the Proton dialect as a third-party plugin for intra-kernel perf tooling (#5119) This PR introduces the `Proton Dialect` to enable intra kernel profiling and tooling for Triton. As a third-party dialect, it serves as the building blocks to create 3rd-party perf tools (e.g., profilers, analysis, modeling) for Triton compiler developers in a compiler-centric way, such as an intra-kernel latency profiler to understand software pipelining, warp specialization, and CTA fine-grained orchestration (e.g., cuda core, tensor core, TMA). Future developments would integrate this dialect with the existing Proton backend profiling infrastructure to make it a powerful and general perf tool utility. As a first step, this PR adds some basic boilerplate code and mechanics, and the `proton.record` op for the `Proton Dialect`. --------- Co-authored-by: Yuanwei Fang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> [DRAFT] Completely remove `MemDesc` from the Triton dialect (#5208) After this PR, `MemDesc` will be a type only in the TritonGPU dialect, as will the `TensorOrMemDesc` interface. [AMD] Prevent wrong reordering of scf operations (#5203) The pass was reordering scf.if operations without checking the extra dependencies coming from the region. For now just prevent this case although this part of the code might still be fragile. [AMD] Cover default case in MfmaGroup (#5218) If you build using the `CMakeLists.txt` and not `setup.py` and you build in `Release` then you get ``` /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’: Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type] ``` Allow Layouts to propogate to local_load (#5219) While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight. The change gives a 40% speed improvement on certain kernels for NVidia GPUs. This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices) [BACKEND] Fix transpose optimization missed during refactor (#5226) [AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (#5139) Adding a shortcut case for fp8 MFMA to dot operand layout conversion that avoids using shared memory, to speed up FP8 attention kernels. [LAYOUTS] [BE] Simplify Ampere/Hopper paths introduced in #5189 (#5200) We simplify the implementation of `getElemsPerThread` and strengthen the preconditions of `getRepForOperand`. More generally, we should try to minimise the calls to `isAmpere` and `isHopper` throughout the codebase. I'll do a pass fixing many of these once we land LLs for `ldmatrix` and Hopper. [BACKEND] Use LL to simplify redundant elements check and fix related issues (#5225) Make TMA tests compatible with older CUDA toolchains (#5221) TMA fences require CUDA toolchain 12.3 or greater, but current gating does not check the CUDA toolchain version. This causes `test_experimental_tma.py` to fail when run with older CUDA toolchains. With cuda-12.0: ``` 55 failed, 9 passed in 18.11s ``` With cuda-12.4: ``` 64 passed in 11.99s ``` With cuda-12.0: ``` 9 passed, 55 skipped in 4.26s ``` With cuda-12.4: ``` 64 passed in 11.96s ``` [CMake] Add C as project language (#5217) If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something like ``` -- Looking for histedit.h CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile): Unknown extension ".c" for file -- Looking for histedit.h - not found /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c try_compile() works only for enabled languages. Currently these are: CXX See project() command to enable other languages. Call Stack (most recent call first): llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file) llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package) llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package) ``` because `C` isn't an enabled project language. [AMD] Fix slow compilation due to inlining print calls (#5153) This PR disables inline of print related functions, which speeds up compilation of test_scan_layouts dramatically. --------- Co-authored-by: Lei Zhang <[email protected]> [AMD] Re-enable overflow test in test_reduce_layouts (#5233) #5153 fixed the issue; but we missed enabling one of the disabled case. [BACKEND] Fix a missed transpose optimization during refactor (#5236) Revert "Allow Layouts to propogate to local_load" (#5237) This is causing some performance regression. I'll investigate and reland it. Reverts #5219 Revert "[AMD] Use warp shuffle for MFMA to Dot operand layout conversion (FP8)" (#5240) It is causing performance regression, revert until it can be investigated Reverts #5139 Updated README.md to show the steps for overriding kernel's IR (#5239) Ensure device context before launching kernel (#3731) If a kernel is launched on a thread which has not initialized a CUDA context (as can happen in the linked issue), it will throw an error. A simple fix is to call `cudaFree(0)` to establish a device context. Fixes #3729 [LLVM] Update to llvm-project@86b69c3 (#5242) This includes llvm/llvm-project#115627 [BUILD] Add a stable symlink to llvm in the triton cache (#5234) Currently the llvm path changes every time the pin updates which makes it annoying to use the included tools. e.g. I use the tablegen language server, but currently need to update my editor config every time the llvm pin changes. This adds a stable symlink which for me is `~/.triton/llvm/llvm-macos-x64`. This will always point to the most recent version of llvm used to build triton. As a bonus this also refactors the symlink update code which was copy-pasted a few times. [PIPELINER] tweak pipeline heuristic (#5247) Don't pipeline the dot accumulator in the default heuristic. In the finer grain control will allow user to decide. Allow Layouts to propogate to local_load (#5219) (#5249) recommit of #5219 While working on some higher dimension tensor kernels, I noticed poor performance due to the fact that layouts wouldn't propagate to local loads. Since we do allow layout folding with local store and local alloc, this seems like a bit of an oversight. The change gives a 40% speed improvement on certain kernels for NVidia GPUs. This also removes asserts in lowering for higher dimensional kernels. As far as I can tell, those restrictions aren't required in practice. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] I have added tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices) <!--- The core Triton is a small number of people, and we receive many PRs (thank you!). To help us review your code more quickly, **if you are a new contributor (less than 3 PRs merged) we ask that you complete the following tasks and include the filled-out checklist in your PR description.** Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) Co-authored-by: Matthew Brookhart <[email protected]> Windows related changes in `CMakeLists.txt` (#5186) Upstreaming some of our Windows related changes assuming that there is interest in this #5094 (comment) and hoping that it will not make it much more difficult to support this CMake file. --------- Signed-off-by: Anatoly Myachev <[email protected]> [AMD] NFC: Unified header guard in third_party/amd (#5244) This commit unified the names of header guards in third_party/amd. [AMD] NFC: Drop v2 Suffix from Stream Pipeline (#5251) Since StreamPipelineV2 has been the default for a while, this commit promoted StreamPipelineV2 to the general StreamPipeline by removing 'v2' suffix. [NFC] Cleanup references to unused index dialect (#5257) Also cleans up some includes clang thinks are unused. [BUILD] Ensure parent directory exists before creating symlinks (#5258) Fixes #5256 Tmp [BACKEND] Fold transpose(splat_const) (#5259) Add folding for a transpose of a splat constant. --------- Co-authored-by: peterbell10 <[email protected]> [LAYOUTS] Use LLs for Hopper whenever we wouldn't use ldmatrix (#5235) The legacy path has some bugs for cases like `kWidth=1`. I'm starting to port Hopper to use LLs to try to isolate them. [AMD] NFC: Cleanup namespace hierachy (#5246) Refactored namespace hierarchy by squeezing separate namespace hierarchy together. [AMD] Fix unhandled profile event in RoctracerProfiler (#5252) Fixes proton unit tests when upgrading to ROCm 6.2 by adding missing event handlers. Magic number is replaced with the corresponding enum value which was added by upgrading the HIP headers #5077. Fix Blocked FMA path in isLayoutOK (#5260) Fixes https://github.com/triton-lang/triton/pull/5235/files/de18e21ddf5bf03f17f779fef032d53ea87a53a0#r1858955613 [Tutorial] Remove incorrect caching from softmax tutorial (#5162) The fused softmax implementation in the tutorial precompiles the kernel to query the register usage of the kernel, based on the parameters used to specialize the kernel. On top of this, it implements a simple caching system for this step based on just the block size. As noted in #4739, this caching is incorrect, because it's also not keyed on the `num_stages` constexpr argument or the shapes of the tensors. Since triton already has its own JIT compilation cache, and this caching bit is not really relevant to the tutorial, just remove it to get rid of the footgun. [INSTRUMENTATION] Generalize code in `test_gpuhello.py` (#5263) Signed-off-by: Anatoly Myachev <[email protected]> Create an aggregate `check-triton-unit` target (#5150) This adds a CMake target `check-triton-unit` that builds an runs all Triton unittests written in gtest. This makes it more conveninent to rebuild and run all unittests at once with finer granularity (instead of `ninja; ctest`). [NFC] Add `test_bessel` into `test_libdevice.py` (#5261) Just a port of one of our tests. I didn't find any similar ones in Triton itself, this should increase the test coverage. Signed-off-by: Anatoly Myachev <[email protected]> [NFC] Add functional regression test for cummax with bool type (#5264) This kernel was obtained using PyTorch inductor some time ago. Signed-off-by: Anatoly Myachev <[email protected]> [AMD] NFC: Unified comment style (#5248) Script: egrep -nrI --exclude-dir "backend" "^\s*/\*+" third_party/amd [AMD] Upgrade AMD CI docker image (#5230) This commits updates the CI to use a new docker image that contains ROCm 6.2.2 with ASan support and PyTorch 2.5.1. This also switches to ubuntu's default clang toolchain instead of using the one which comes with ROCm. Implement `dot_scaled(mmav3)` (#5269) As per title [BUILD] Some CMake cleanup/modernisation (#5271) - Prefer `find_package` over ad-hoc variable passing - Prefer `target_` api vs global `_directories` apis - Use `target_link_options` to specify link options instead of `target_link_libraries` Closes #5270 [DIALECT] Rename `triton_gpu` to `ttg` and `triton_nvidia_gpu` to `ttng` (#5266) It may cause changes for downstream tasks but we think it's beneficial to shorten dialect name and make them consistent. That is, we are using `tt` to represent the `triton` dialect. [BACKEND] Fix inline asm bug for multiple packed <32bit output (#5273) Resolves #5272 - Fixes logic for walking result struct from LLVM InlineAsm in case of multiple sub-32bit results - Adds lit test [NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting (#5222) This is a follow-up to the dotOp hoisting optimization for WGMMA (MMAv3). See #5003 (comment) In short, when upcasting operand A in registers prior to WGMMA and when pipelining is enabled, `AsyncCopyGLobalToLocal`'s src gmem blocked encoding will have `sizePerThread` > smem view's `vec` (along the contiguous dimension). This will resulting in multiple `cp.async` instructions being generated for a contiguous global data segment, resulting in uncoalesced loads. This was previously confirmed in ncu. See above comment for an example. I've added a generalized fix in a new pass after the pipeliner. I've reused the logic in the LLVM lowering for `AsyncCopyGlobalToLocal` to calculate the max contiguous copy size. I compare that to the blockEnc's `sizePerThread` along the inner (contiguous) dimension. If the former is less than latter, I set the latter to former. When A is k-major, can verify a small perf improvement and that ncu no longer reports uncoalesced loads. When A is m-major, this pass is a no-op because `copy size == sizePerThread == 16` ptal, thanks @ThomasRaoux [Triton] Add `tl.gather` with a naive codegen implementation (#5262) This PR adds a `tl.gather` builtin that implements a local gather along a single axis, with semantics matching `torch.gather`. `tl.gather` generates a `tt.gather` op, which is piped through the compiler mostly untouched at the moment, since the codegen is very naive. The `tt.gather` is implemented by writing the source tensor into shared memory and then performing a gather out of shared memory, thus it requires scratch space to be allocated. In a follow-up, I will implement an optimized layout rule for the op that ensures the gather axis fits into a single warp, allowing the gather to be implemented using warp shuffles. There are other avenues for optimization as well: `tt.gather(tt.load)` where the load only has one use can be lowered into a DMA from global memory to shared, and then gather directly from shared. [NVIDIA][Launcher] Ensure device context is valid before calling getPointer (#5276) [CMAKE] Add `triton-tensor-layout` dep to lit tests (#5275) Noticed this when `triton_gpu` was renamed to `ttg`. [BACKEND] Fix and document logic for creating warp shapes in MMAv3 (#5277) [NFC] Remove dead code for python<3.8 (#5280) Signed-off-by: Anatoly Myachev <[email protected]> [NFC] Remove `CMAKE_VERBOSE_MAKEFILE` var (#5282) Warning: ```bash CMake Warning: Manually-specified variables were not used by the project: CMAKE_VERBOSE_MAKEFILE ``` Signed-off-by: Anatoly Myachev <[email protected]> [AMD] Use Linear Layout convertions for AMDWmma (#5255) Enable LL conwertions for WMMA as well as for MFMA layouts. See also: #5210 Signed-off-by: Ilya Veselov <[email protected]> Add tests for 3D local_load local_alloc and relax asserts (#5285) Also switch 3D dot_operand cases to use linear layout path, This may be suboptimal in some cases but that solves the functionality problems which is more important. There is ongoing work from Mario that should get the code quality to be good again soon. [Build] Don't require Development.Embed python component (#5287) This component is missing from the wheel building image, so we need to make the requirement more specific. https://github.com/triton-lang/triton/actions/runs/12081047335/job/33689420657#step:6:332 [NFC] Remove unused forOp argument from `setStageCluster` (#5288) <git-pr-chain> [NFC] Remove unused forOp argument from `setStageCluster` 1. 👉 #5288 👈 **YOU ARE HERE** 1. #5289 1. #5290 </git-pr-chain> [PROTON] Don't use designated initializers in `CuptiPCSampling.cpp` as it relates to c++20 (#5291) Hi @Jokeren, these changes relates to your PR: #4674, so I would like to ask if this was done on purpose? (considering that the project declares support for the c++17 standard). I discovered this while trying to compile proton using MSVC. It looked like this: `\CuptiPCSampling.cpp(18): error C7555: use of designated initializers requires at least '/std:c++20'`. This might also be a good opportunity to ask you about your plans to transition Triton to `с++20`. --------- Signed-off-by: Anatoly Myachev <[email protected]> Add back missing check Replace triton_gpu with ttg Update Update Update Define `pytest-forked` and `pytest-xdist` as `tests` target deps (#5292) This way, the dependencies needed for testing are localized in one place - `setup.py` (instead of several), which makes maintenance easier. Signed-off-by: Anatoly Myachev <[email protected]> [BUILD] Skip installing test related python packages (#5294) #5292 failed because of macOS build. Since we don’t run any tests on macOS anyway, it’s fine to simply skip them. Update Update [TESTING] Add golden sample test for pipelining matmul with descriptors (#5289) <git-pr-chain> [TESTING] Add golden sample test for pipelining matmul with descriptors 1. #5288 1. 👉 #5289 👈 **YOU ARE HERE** 1. #5290⚠️ ⚠️ Please **do not click the green "merge" button** unless you know what you're doing. This PR is part of a chain of PRs, and clicking the merge button will not merge it into master.⚠️ ⚠️ </git-pr-chain> Specify in `setup.py` that `setuptools>=40.8.0` is a required dependency (#5293) Closes #5090 vancoykendall is right that the dependency is used not only during build. However, for now I added it to `setup.py`, since the migration of dependencies to `pyproject.toml` has not yet occurred. Signed-off-by: Anatoly Myachev <[email protected]> [TOOLS] Improve `generate-test-checks.py` (#5300) - Format the doc string using the `reStructuredText` format. - Lift the example instructions from the `.mlir.in` file to the docstring. Previously we matched the `module` keyword twice and encountered errors such as `assert len(output_segments) == len(source_segments),`. It's also fine to update the regex to something like `\bmodule` to solve the problem, but I think lifting it from the input file is just simpler. [NFC][DIALECT] Remove dependency on `mlir::tensor::TensorDialect` (#5303) [IR] Improve `ttg.memdesc` (#5296) - Add an `allocShape` field to denote the shape a memory descriptor when it's allocated. The value will be propagated to all its descendants created through `subview` ops. - Make `encoding` and `memorySpace` fields required instead of optional. - Implement the `getAlias` function for `#ttg.shared_memory` to shorten its length in `.mlir` files Update Update [Pipeliner] Handle masking for atomic_rmw (#5231) This commit is to support atomic_rmw in the function predicateOp to mask operations during scheduling. [TESTS] Forward fix for CI break (#5323) PR #5231 was authored before the `triton_gpu` -> `ttg` rename and CI is currently broken. Search for `ptxas` only for cuda backend in `supports_tma` function (#5314) For other backends, `ptxas` may not be installed. Signed-off-by: Anatoly Myachev <[email protected]> [LLVM] Update to llvm/llvm-project@1f20eee6dc36 (#5308) This pulls in the AMDGPU backend support for the gfx950 target. We need to fix the rewrites in `Combine.td` given that llvm/llvm-project#112700 adds a new attribute for denorm mode for `arith.addf`. --------- Co-authored-by: Lei Zhang <[email protected]> [AMD][BACKEND] Add gfx950 target definitions. (#5281) Enable new arch target since backend support has been added. [AMD] Adjust local_store and global_load ordering (#5254) This commit adjusts local store and global load ordering to let local store be ahead of global load when they are not in the same stage. It should help GEMM kernel performance. Re-align main and llvm-head (#5334) We have a couple of PRs that landed in the `llvm-head` branch that are not in `main`. Merging those into `main` to prevent further divergence between branches. --------- Co-authored-by: Won-Kyu Park <[email protected]> Co-authored-by: Lei Zhang <[email protected]> [PIPELINER] Cleanup of LoopScheduling.cpp, introduction of AssignLatencies (#5176) This change breaks down LoopScheduling into two sub-passes: latency assignment and actual scheduling. Latency assignment is a transformation that analyzes the loop and based on the requested number of stages it assigns "latencies" to the ops that are going to be converted to async ops by the pipeliner. Latencies are expressed in terms of number of iterations of the loop and can be thought as per-operation num_stages. Scheduling transformation takes these latencies and builds a pipeliner schedule based on it. The process of building a schedule was slightly rewritten to simplify the code and cleanup the logic that was no longer needed after recent refactoring. Breaking down the schedule into latency assignment and proper scheduling has number of purposes: 1. Code became more modular, with cleaner interfaces that helps with maintanance 2. Both parts can be tested in separation, I have added lit tests for both pieces. We can finally test our pipeliner infrastructure in manageable chunks 3. It opens up opportunity to expose per-op "latencies" to the frontend, enabling creating user-defined schedules right from the language level Next step in the cleanup process is to clearly separate lowering and pipelining phases. Update Update Update Update Update Update Update Update
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9)
…-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09)
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9)
Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68)
* [BACKEND][NVIDIA] Add Lowering for Shared-to-MMAv3-DotOp Copy (triton-lang#5009) Allows for upcasting in DotOp encoding in RF. This lowering path is not currently in use; pending triton-lang#5003 (cherry picked from commit cfddb09) * [AMD] Add initial support for scaled_dot(mxfp8, fp8) (triton-lang#4994) This commit adds initial support for scaled_dot with mxfp8 LHS and fp8 RHS. It supports both mfma32 and mfma16 intrinsic variants. Right now we are missing software emulation for `Float8E4M3FN` type, so this only enables for `Float8E5M2`. (cherry picked from commit 3549db8) * [Frontend][Backend] Implement support for scale_dot(-, bf16) (triton-lang#4996) In the passing we also improve a few other things: - Now `scaled_dot` accepts both uint8/uint16 fp8/bf16 as inputs (before you had to cast it to uint8, which was weird when extending it to bf16). - Add `scaled_dot` to the docs and improve the docs overall (have not render them, might need a few further tweaks) (cherry picked from commit 23c9ec1) * [BACKEND] Improve detection of register to register conversion (triton-lang#4991) Specifically, it fixes problems when `srcLayout` and `dstLayout` have different number of registers but the same number of not free registers. We solved the problem by padding free registers to either `srcLayout` or `dstLayout`, but this can be improved by fixing the `invertAndCompose` function. (cherry picked from commit 15c5e55) * [BACKEND] Replace `isMmaToDotShortcut` with linear layout based logic (triton-lang#4951) This PR removes the legacy `isMmaToDotShortcut` and its associated shortcut conversion. (cherry picked from commit 1d5fdfe) * [BACKEND]Fix DotOperand(Ampere) LinearLayoutConversion (triton-lang#5038) We also clean a bit `TritonGPU/IR/Dialect.cpp` using some auxiliary functions to make the intentions a bit clearer. We add a few asserts in the `LinearLayoutConversion` to make sure it's clear why we do certain things here and there. We also kill `getCvtOrder`, as it was not used anywhere (cherry picked from commit 56584c4) * [BACKEND] Fix uses of getOrder(DotOperand(Nvidia) and MMA(Nvidia)) (triton-lang#5055) We use `getOrder` very liberally throughout the codebase, when we really meant to use `getThreadOrder`. This is an issue with the input layout is an `DotOperand(mma(opIdx=1))`, where the thread order and the matrix order are opposite. Found this to be an issue when a PR changed the `getOrder` of `DotOperand(Hopper)` to an incorrect one and CI still passed! The issue here is that the LLVM lowering for wgmma and the LinearLayout does not use `getOrder`, but there are many other subsystems do, and many heuristics would be getting an incorrect order, and potentially be disabled. This is particularly problematic for `DotOperand(opIdx=1)` in nvidia hardware, as `getThreadOrder` and `getOrder` are different! While doing so we: - Audit most (all?) the calls to `getOrder(dotOperand)`. It turns out that most of them really meant `getThreadOrder` - Fix the ordering methods of `SliceEncodingAttr` to be consistent - Move the implementation of `getWarpOrder` to the Attr classes, because of OOP The test strategy was to add `llvm::report_fatal_error("Testing");` within `getOrder(nvidiaMma)` and `getOrder(DotOperand(nvidiaMma))` and triaging all errors that were raised in CI. (cherry picked from commit 38a11b8) * [AMD] Reland instruction scheduling hint changes (triton-lang#4940) This commit relands triton-lang#4819 with the following fixes: * Changed to a better way to mark opIdx for loads * Replaced temlate-based `rewindUnaryOps` to use regular for-loops. The new way is more robust and can handle other unary ops automatically. * Replaced `instr.sched.barriers` using the ones from `rocdl` dialect from the MLIR upstream * Extended lit tests (cherry picked from commit ee5876c) * [AMD] Enable scaled_dot(-, bf16) (triton-lang#5029) (cherry picked from commit f062540) * [AMD] Add support for scaled_dot(mxfp4, -) (triton-lang#5034) This commit adds support for mxfp4 typed A tensor for sacled dot in the AMD backend. We moved the `convertMxfp4x2ToBf16x2` impl from NVIDIA side to a common path to reuse. (cherry picked from commit edc5c5c) * [BACKEND] Minor Bugfixes for SharedToDotOperand MMAv3 (triton-lang#5030) Two bugfixes following triton-lang#5009. - When `BLOCK_M=64` and `num_warps > 4`, the order of warps for DotOpEncoded tensor should be M-major instead of N-major, since WGMMA expects the 4 warps in each warp group to be stacked along the M dimension. - Should use `mmaBitwidth` instead of `bitwidth` when calculating `numRep` in `SharedToDotOperandMMAv2OrV3`. This was missed in a bad rebase. @lezcano I encountered these bugs when attempting to locally test the [DotOp hoisting PR](triton-lang#5003) after rebasing (they normally would be caught by `test_core.py` but that path was not yet enabled in the last PR). With these fixes added, I was able to successfully validate against pytorch. (cherry picked from commit e82dfd9) (cherry picked from commit 5287a68) * [BACKEND] Get rid of unpack/pack I32 (triton-lang#5044) - Removed functions related to unpacking and packing I32 values. - Updated utilities to handle conversion of mxfp4 values without packing/unpacking I32. - Move the register value ordering logic from the element-wise operation lowering to the dot operation lowering. - Use linear layout to handle conversions between almost all distributed layouts. - Clean up data loading and mma computation involving `repN`, `repK`, and `repM`. (cherry picked from commit 1cf7b1b) (cherry picked from commit 376fe7e) * Consolidate `getOrder` as "element order" and implement `getRepOrder` for general and NVIDIA layouts (triton-lang#5089) This partially reverts commit 38a11b8. Supersedes triton-lang#5085 It also documents that we are implicitly choosing a way to tile a full tensor depending on the layout. See triton-lang#5085 (comment) (cherry picked from commit 57643b3) (cherry picked from commit ffb2032) --------- Co-authored-by: Gary Geng <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Mario Lezcano Casado <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: ravil-mobile <[email protected]>
Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS" (LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton previously will copy A from global memory (GMEM) into registers (RF), perform the elementwise ops, and then copy to shared memory (SMEM) to perform SS WGMMA.
This PR adds an optimization for the case above to use RS GEMM. This requires the following changes:
NOTE: This may not see perf gain, and may even see perf loss, for certain shapes (e.g. small-K), and additional optimizations are in a separate PR (still more optimizations are WIP). Please advise on the merging strategy.