[Triton-MLIR] Remaining issues in migration #673

goostavz · 2022-09-19T06:50:36Z

Update in 2022-11-19, remaining issues from backend point of view

Dot
- corretness of mma v2 when a/b is not default order @Superjomn
- corretness of mma v2 when a/b is transposed @Jokeren @ptillet
- correctness of mma v1 @Superjomn
  - backend failure when ptxas @Superjomn
  - pipeline pass that support mma v1 @Jokeren
  - gpucombine pass assigning proper mma layout for v1 @ben-zhang-609
- correctness with prefetch pass @Jokeren
- correctness with fmadot in several cases/environment @Jokeren
- AllowTF32 is associated with ops but not layout (for FMADot) @Jokeren
  - Pass information from the frontend
  - Handle AllowTF32 in the backend
- test_gemm[32-32-64-4-32-32-32-False-False] ,test_gemm(*[16, 16, 64, 4, 16, 16, 16, False, False]) fails with pipeline on.
- test_gemm[16-16-64-4-8-8-8-False-False] fail no matter pipeline is on/off
  - fallback of insert_slice_async @Jokeren
  - correctness of dot & layout conversion when thread is wrapped (hold due to low priority)
- insert_slice_async when other is not zero @Jokeren (postponed to after merge)
- convert_layout mma -> dot_operand when register reuse is not possible @Jokeren
Pipeline
- A bug at collectDeps(yieldOp->getOperand(arg.getArgNumber() - 1), when arg.getArgNumber() = 0. @Jokeren
Atomic
- AtomicRMW @donproc
- Fix unexpected IR in the optimizer (accum of warpsPerThread is not 32) @Jokeren
- AtomicCAS @donproc
- frontend bug (control flow related) in the atomicCAS tests in test_core.py, which also blocks the layernorm tutorial.
- optimizer passes such as coalesce to well support atomic @ptillet
Reduce
- From the backend, there is still one more barrier comparing with legacy code or manual reduction. From the performance result it seems to be fine. Thus not planned for the moment.
- gpucombine optimization in case of multiple reductions (softmax & layer norm) @ptillet
- argmin and argmax (ongoing, @qliu93)
others
- value cache in emitIndices() @goostavz (will continue after switching the branch)
Performance regression
- mma v2 w/ and w/o a/b transposed @Jokeren (could be postponed to the point after the merge)
  - epilogue fp16 conversion
  - epilogue shared memory conflicts
  - register spills
- mma v1
- fma
- dot with epilogue fusion
- memory bound patterns with atomic (layernorm)
- attention

TritonGPUToLLVM Pass:

Basic features

Index calculation for blocked_layout
- more UTs for corner case verification, higher ranks, reversed order etc.
Basic op support, Load/Store, GEP, Splat, Constant, Elementwise, Broadcast
VecAdd correctness verified in python e2e flow
add scf support in backend

Load/StoreOp related:

Refactoring of LoadOp with PtxInstr abstraction
vectorization support with AxisInfo
- mask support in load/store vectorization
gep + load/store fold optimization （ check if ptxas has already done this first, https://github.com/openai/triton-mlir/discussions/76 ）
verification of L1 eviction policy for load/store (lower)

Layout Conversion Related

Shared memory initialization in TritonGPUToLLVM from the results of Allocation/Alias
ConvertLayoutOp support (higher priority)
- blocked -> blocked
- blocked -> shared
- shared → blocked (low priority)
- shared → mma
- mma → blocked
sliced_layout & transpose kernel

MMA related:

basic Dot codegen

Codegen for mma dot
- correctness verification of fp16 mma
- correctness verification of int8 mma

Dot with n_buffer/prefetch optimization

alloc_tensor/insert_slice_async/extract_slice support
e2e of optimized gemm with double_buffer/N_buffer

Completeness of op coverage

Completeness of Elementwise Ops
Reduce Ops
Atomic Ops (ongoing, @dongdong)
Handle bfloat16 in test_core.py
Handle % in test_core.py (semantics of frem changed between NVPTX11 and NVPTX14; Triton no longer behaves correctly).

The text was updated successfully, but these errors were encountered:

Jokeren · 2022-09-19T09:05:20Z

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

goostavz · 2022-09-19T10:21:55Z

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Chunwei is now working on the current "mma codegen", (including load A/B from shared memory and mma, in the current mma ir, mma takes its input from shared layout, which will be decomposed into convert_layout(shared->mma) + mma(mma in/mma out) in the future ). The support of convert_layout(shared->mma) should just be some refactor on the code Chunwei is developing for now. So in my understanding, there are some overlap between the works you are doing for now. @Jokeren @chunwei

BTW, i noticed in https://github.com/openai/triton-mlir/discussions/107, the IR is made of load + convert_layout(blocked->shared), in my understanding we need to convert it into tensor.insert_slice_async in order to support LDGSTS, is this already done? We may need to support convert_layout(blocked->shared) if this will not be done in short term, or else i think it should be of higher priority to support tensor.insert_slice_async -> ldgsts.

goostavz · 2022-09-19T11:51:49Z

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

BTW, i may need to confirm it first: do you refer to the support of convert_layout(shared->mma) in TritonGPUToLLVM, or, decomposing of dot(shared->mma) into convert_layout(shared->mma) + dot(mma->mma) in the optimizer?

Superjomn · 2022-09-19T12:00:40Z

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Yeah, I am porting the dot codegen to mlir, and here is the WIP PR. I've almost finished the mma16816-related logic, leaving the mma884 and fmadot WIP.

According to the discussion:

 %83 = triton_gpu.convert_layout %81 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %84 = triton_gpu.convert_layout %82 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %85 = tt.dot %83, %84, %cst_1 {allowTF32 = true} : tensor<64x64xf32, #shared> * tensor<64x64xf32, #shared> -> tensor<64x64xf32, #mma>

I plan to port the original visit_dot_instr to mlir. This function includes both the A&B loading from smem to registers and the codegen of MMA instruction.
In my understanding, the former is the main logic of convert_layout(shared -> MMA).
Please correct me if I am wrong.

@Jokeren

ptillet · 2022-09-19T14:37:56Z

Let me summarize the plans, just for the sake of clarity:

On our side:

@daadaada is working on https://github.com/openai/triton-mlir/discussions/104, that is: (1) Adding MatmulDotOperand; (2) changing the input layout requirements of mma_inst to take MatmulDotOperand<OutputLayout> as inputs; (3) Decomposing the mma instructions in pipelined matmul loops into two mma instruction, and compute + yield the operands of the first one at the end of the previous iterations.
@ptillet is still working on the runtime. The new runtime is merged into the master branch, and now I will backport it to the triton-mlir branch. This will make merging triton-mlir into master a breeze when the time comes. After this is done, I will start working on getting all non-matmul/non-reductions unit tests to pass.
@Jokeren was working last week on investigating bufferization. This is a longer term plan, but I'm happy we were able to isolate the potential problems. For now getting matmuls to work is probably the highest priority (on our side we'd like to get it working by October 1st), so I'm sure he'd also be happy to assist with any help you need. Maybe the reduction codegen would be better than our initial plan? @Jokeren is already familiar with visit_reduce_inst in the master branch

On your side:

@Superjomn is working on matmul codegen.
@goostavz Since the PR for blocked -> blocked conversion has been merged (🥳) and there is no more PR opened, I don't really know the plan.

yangjunpro · 2022-09-19T14:49:07Z

@ptillet for reduction codegen, another NVIDIA colleague is working on that, he should be capable of submitting the PR in one or two weeks, that's why it is marked as "ongoing" in the original post. @goostavz knows more than me with the detailed plan since he is helping coordinate the related work in my team.

Jokeren · 2022-09-19T17:37:52Z

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Yeah, I am porting the dot codegen to mlir, and here is the WIP PR. I've almost finished the mma16816-related logic, leaving the mma884 and fmadot WIP.

According to the discussion:
 %83 = triton_gpu.convert_layout %81 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %84 = triton_gpu.convert_layout %82 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %85 = tt.dot %83, %84, %cst_1 {allowTF32 = true} : tensor<64x64xf32, #shared> * tensor<64x64xf32, #shared> -> tensor<64x64xf32, #mma>
I plan to port the original visit_dot_instr to mlir. This function includes both the A&B loading from smem to registers and the codegen of MMA instruction. In my understanding, the former is the main logic of convert_layout(shared -> MMA). Please correct me if I am wrong.

@Jokeren

@Superjomn Thanks for the heads up. Yeah, I believe your are right. I'll go read your code for now without applying changes.

To accommodate the prefetch optimization, we will take out the shared memory load logic from the triton.dot, but code generation part should be mostly the same.

goostavz · 2022-09-20T14:20:12Z

Let me summarize the plans, just for the sake of clarity:

On our side:

@daadaada is working on pre-fetching shared memory into registers triton-mlir#104, that is: (1) Adding MatmulDotOperand; (2) changing the input layout requirements of mma_inst to take MatmulDotOperand<OutputLayout> as inputs; (3) Decomposing the mma instructions in pipelined matmul loops into two mma instruction, and compute + yield the operands of the first one at the end of the previous iterations.

@ptillet is still working on the runtime. The new runtime is merged into the master branch, and now I will backport it to the triton-mlir branch. This will make merging triton-mlir into master a breeze when the time comes. After this is done, I will start working on getting all non-matmul/non-reductions unit tests to pass.

@Jokeren was working last week on investigating bufferization. This is a longer term plan, but I'm happy we were able to isolate the potential problems. For now getting matmuls to work is probably the highest priority (on our side we'd like to get it working by October 1st), so I'm sure he'd also be happy to assist with any help you need. Maybe the reduction codegen would be better than our initial plan? @Jokeren is already familiar with visit_reduce_inst in the master branch

On your side:

@Superjomn is working on matmul codegen.

@goostavz Since the PR for blocked -> blocked conversion has been merged (🥳) and there is no more PR opened, I don't really know the plan.

I'm working on the rest of the layout_conversion now. I think these two are necessary for our next mile stone (basic gemm without nbuffer/prefetch optimization):
(1) blocked->shared
(2) mma->blocked
In my understanding (1) will not be that important for the longer term (it should be replaced with tensor.insert_slice_async to support LDGSTS in my understanding), but it is necessary for the moment.
Also scf lowering in the backend is needed, I don't expect there should be any actual effort though.

Considering of how Keren can better help us: The first choice i can think of is (1) to guarantee that the pipeline/prefetch optimization in the optimizer(including allocation) is OK. If you still have free bandwidth, (2) the codegen of tensor.slice/extract is another choice. And, (3) layout_conversion(blocked -> shared) is a third choice, which i haven't actually started for now. (3) will require heavily co-debug with dot codegen so I personally think (2) is a better choice than (3). What do you guys think?

Jokeren · 2022-09-20T15:38:57Z

(2) mma->blocked

@goostavz How this is related with LDGSTS?

(1) to guarantee that the pipeline/prefetch optimization in the optimizer(including allocation) is OK

I think @daadaada will take care of the prefetch optimization.

(2) the codegen of tensor.slice/extract is another choice.

Sounds good. tensor.extract_slice is straightforward but will eventually be replaced by memref.subview. Since tensor.insert_slice_async will use the async modifier in the load operation, I think a bunch of code will be similar to LoadOpConversion.

goostavz · 2022-09-21T00:26:57Z

@goostavz How this is related with LDGSTS?

Sorry it was a typo. 1 (not 2) is related with LDGSTS.

In this thread:
load A/B tile into shared memory is represented via a load and a convert_layout(blocked->shared).
In my understanding this should be replaced with tensor.insert_slice_async in order to support LDGSTS.
Pls correct me if i'm wrong.

ptillet · 2022-09-21T00:47:31Z

Yes, there are two ways to write the matmul loop so that the tiles of A and B end up being loaded from DRAM to shared memory:

load + convert_layout(blocked->shared). Necessary for V100.
tensor.insert_slice_async. Gets eventually lowered to LDGSTS.
On architectures that have LDGSTS, the pipeline pass converts method 1) into method 2)

ptillet · 2023-02-23T04:24:16Z

Closing this as the MLIR rewrite is officially complete

Jokeren pinned this issue Nov 19, 2022

goostavz changed the title ~~[Triton-MLIR][Backend] TritonGPUToLLVM~~ [Triton-MLIR] Remaining issues in migration Nov 28, 2022

yangjunpro unpinned this issue Feb 5, 2023

ptillet closed this as completed Feb 23, 2023

brunomazzottiamd pushed a commit to brunomazzottiamd/triton that referenced this issue Jan 29, 2025

use gfx942 (triton-lang#673)

27a1b5b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Triton-MLIR] Remaining issues in migration #673

[Triton-MLIR] Remaining issues in migration #673

goostavz commented Sep 19, 2022 •

edited

Loading

Jokeren commented Sep 19, 2022

goostavz commented Sep 19, 2022 •

edited

Loading

goostavz commented Sep 19, 2022

Superjomn commented Sep 19, 2022

ptillet commented Sep 19, 2022 •

edited

Loading

yangjunpro commented Sep 19, 2022 •

edited

Loading

Jokeren commented Sep 19, 2022

goostavz commented Sep 20, 2022 •

edited

Loading

Jokeren commented Sep 20, 2022 •

edited

Loading

goostavz commented Sep 21, 2022 •

edited

Loading

ptillet commented Sep 21, 2022

ptillet commented Feb 23, 2023

[Triton-MLIR] Remaining issues in migration #673

[Triton-MLIR] Remaining issues in migration #673

Comments

goostavz commented Sep 19, 2022 • edited Loading

Update in 2022-11-19, remaining issues from backend point of view

TritonGPUToLLVM Pass:

Basic features

Load/StoreOp related:

Layout Conversion Related

MMA related:

basic Dot codegen

Dot with n_buffer/prefetch optimization

Completeness of op coverage

Jokeren commented Sep 19, 2022

goostavz commented Sep 19, 2022 • edited Loading

goostavz commented Sep 19, 2022

Superjomn commented Sep 19, 2022

ptillet commented Sep 19, 2022 • edited Loading

yangjunpro commented Sep 19, 2022 • edited Loading

Jokeren commented Sep 19, 2022

goostavz commented Sep 20, 2022 • edited Loading

Jokeren commented Sep 20, 2022 • edited Loading

goostavz commented Sep 21, 2022 • edited Loading

ptillet commented Sep 21, 2022

ptillet commented Feb 23, 2023

goostavz commented Sep 19, 2022 •

edited

Loading

goostavz commented Sep 19, 2022 •

edited

Loading

ptillet commented Sep 19, 2022 •

edited

Loading

yangjunpro commented Sep 19, 2022 •

edited

Loading

goostavz commented Sep 20, 2022 •

edited

Loading

Jokeren commented Sep 20, 2022 •

edited

Loading

goostavz commented Sep 21, 2022 •

edited

Loading