Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Triton-MLIR] Remaining issues in migration #673

Closed
45 of 64 tasks
goostavz opened this issue Sep 19, 2022 · 12 comments
Closed
45 of 64 tasks

[Triton-MLIR] Remaining issues in migration #673

goostavz opened this issue Sep 19, 2022 · 12 comments

Comments

@goostavz
Copy link
Collaborator

goostavz commented Sep 19, 2022

Update in 2022-11-19, remaining issues from backend point of view

  • Dot
    • corretness of mma v2 when a/b is not default order @Superjomn
    • corretness of mma v2 when a/b is transposed @Jokeren @ptillet
    • correctness of mma v1 @Superjomn
    • correctness with prefetch pass @Jokeren
    • correctness with fmadot in several cases/environment @Jokeren
    • AllowTF32 is associated with ops but not layout (for FMADot) @Jokeren
      • Pass information from the frontend
      • Handle AllowTF32 in the backend
    • test_gemm[32-32-64-4-32-32-32-False-False] ,test_gemm(*[16, 16, 64, 4, 16, 16, 16, False, False]) fails with pipeline on.
    • test_gemm[16-16-64-4-8-8-8-False-False] fail no matter pipeline is on/off
      • fallback of insert_slice_async @Jokeren
      • correctness of dot & layout conversion when thread is wrapped (hold due to low priority)
    • insert_slice_async when other is not zero @Jokeren (postponed to after merge)
    • convert_layout mma -> dot_operand when register reuse is not possible @Jokeren
  • Pipeline
    • A bug at collectDeps(yieldOp->getOperand(arg.getArgNumber() - 1), when arg.getArgNumber() = 0. @Jokeren
  • Atomic
    • AtomicRMW @donproc
    • Fix unexpected IR in the optimizer (accum of warpsPerThread is not 32) @Jokeren
    • AtomicCAS @donproc
    • frontend bug (control flow related) in the atomicCAS tests in test_core.py, which also blocks the layernorm tutorial.
    • optimizer passes such as coalesce to well support atomic @ptillet
  • Reduce
    • From the backend, there is still one more barrier comparing with legacy code or manual reduction. From the performance result it seems to be fine. Thus not planned for the moment.
    • gpucombine optimization in case of multiple reductions (softmax & layer norm) @ptillet
    • argmin and argmax (ongoing, @qliu93)
  • others
    • value cache in emitIndices() @goostavz (will continue after switching the branch)
  • Performance regression
    • mma v2 w/ and w/o a/b transposed @Jokeren (could be postponed to the point after the merge)
      • epilogue fp16 conversion
      • epilogue shared memory conflicts
      • register spills
    • mma v1
    • fma
    • dot with epilogue fusion
    • memory bound patterns with atomic (layernorm)
    • attention

TritonGPUToLLVM Pass:

Basic features

  • Index calculation for blocked_layout
    • more UTs for corner case verification, higher ranks, reversed order etc.
  • Basic op support, Load/Store, GEP, Splat, Constant, Elementwise, Broadcast
  • VecAdd correctness verified in python e2e flow
  • add scf support in backend

Load/StoreOp related:

  • Refactoring of LoadOp with PtxInstr abstraction
  • vectorization support with AxisInfo
    • mask support in load/store vectorization
  • gep + load/store fold optimization ( check if ptxas has already done this first, https://github.com/openai/triton-mlir/discussions/76
  • verification of L1 eviction policy for load/store (lower)

Layout Conversion Related

  • Shared memory initialization in TritonGPUToLLVM from the results of Allocation/Alias
  • ConvertLayoutOp support (higher priority)
    • blocked -> blocked
    • blocked -> shared
    • shared → blocked (low priority)
    • shared → mma
    • mma → blocked
  • sliced_layout & transpose kernel

MMA related:

basic Dot codegen

  • Codegen for mma dot
    • correctness verification of fp16 mma
    • correctness verification of int8 mma

Dot with n_buffer/prefetch optimization

  • alloc_tensor/insert_slice_async/extract_slice support
  • e2e of optimized gemm with double_buffer/N_buffer

Completeness of op coverage

  • Completeness of Elementwise Ops
  • Reduce Ops
  • Atomic Ops (ongoing, @dongdong)
  • Handle bfloat16 in test_core.py
  • Handle % in test_core.py (semantics of frem changed between NVPTX11 and NVPTX14; Triton no longer behaves correctly).
@Jokeren
Copy link
Contributor

Jokeren commented Sep 19, 2022

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

@goostavz
Copy link
Collaborator Author

goostavz commented Sep 19, 2022

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Chunwei is now working on the current "mma codegen", (including load A/B from shared memory and mma, in the current mma ir, mma takes its input from shared layout, which will be decomposed into convert_layout(shared->mma) + mma(mma in/mma out) in the future ). The support of convert_layout(shared->mma) should just be some refactor on the code Chunwei is developing for now. So in my understanding, there are some overlap between the works you are doing for now. @Jokeren @chunwei

BTW, i noticed in https://github.com/openai/triton-mlir/discussions/107, the IR is made of load + convert_layout(blocked->shared), in my understanding we need to convert it into tensor.insert_slice_async in order to support LDGSTS, is this already done? We may need to support convert_layout(blocked->shared) if this will not be done in short term, or else i think it should be of higher priority to support tensor.insert_slice_async -> ldgsts.

@goostavz
Copy link
Collaborator Author

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

BTW, i may need to confirm it first: do you refer to the support of convert_layout(shared->mma) in TritonGPUToLLVM, or, decomposing of dot(shared->mma) into convert_layout(shared->mma) + dot(mma->mma) in the optimizer?

@Superjomn
Copy link
Collaborator

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Yeah, I am porting the dot codegen to mlir, and here is the WIP PR. I've almost finished the mma16816-related logic, leaving the mma884 and fmadot WIP.

According to the discussion:

 %83 = triton_gpu.convert_layout %81 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %84 = triton_gpu.convert_layout %82 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %85 = tt.dot %83, %84, %cst_1 {allowTF32 = true} : tensor<64x64xf32, #shared> * tensor<64x64xf32, #shared> -> tensor<64x64xf32, #mma>

I plan to port the original visit_dot_instr to mlir. This function includes both the A&B loading from smem to registers and the codegen of MMA instruction.
In my understanding, the former is the main logic of convert_layout(shared -> MMA).
Please correct me if I am wrong.

@Jokeren

@ptillet
Copy link
Collaborator

ptillet commented Sep 19, 2022

Let me summarize the plans, just for the sake of clarity:

On our side:

  • @daadaada is working on https://github.com/openai/triton-mlir/discussions/104, that is: (1) Adding MatmulDotOperand; (2) changing the input layout requirements of mma_inst to take MatmulDotOperand<OutputLayout> as inputs; (3) Decomposing the mma instructions in pipelined matmul loops into two mma instruction, and compute + yield the operands of the first one at the end of the previous iterations.
  • @ptillet is still working on the runtime. The new runtime is merged into the master branch, and now I will backport it to the triton-mlir branch. This will make merging triton-mlir into master a breeze when the time comes. After this is done, I will start working on getting all non-matmul/non-reductions unit tests to pass.
  • @Jokeren was working last week on investigating bufferization. This is a longer term plan, but I'm happy we were able to isolate the potential problems. For now getting matmuls to work is probably the highest priority (on our side we'd like to get it working by October 1st), so I'm sure he'd also be happy to assist with any help you need. Maybe the reduction codegen would be better than our initial plan? @Jokeren is already familiar with visit_reduce_inst in the master branch

On your side:

  • @Superjomn is working on matmul codegen.
  • @goostavz Since the PR for blocked -> blocked conversion has been merged (🥳) and there is no more PR opened, I don't really know the plan.

@yangjunpro
Copy link
Collaborator

yangjunpro commented Sep 19, 2022

@ptillet for reduction codegen, another NVIDIA colleague is working on that, he should be capable of submitting the PR in one or two weeks, that's why it is marked as "ongoing" in the original post. @goostavz knows more than me with the detailed plan since he is helping coordinate the related work in my team.

@Jokeren
Copy link
Contributor

Jokeren commented Sep 19, 2022

FYI, I plan to take a look on shared → mma this week. Hope it doesn't conflict with your plan.

Yeah, I am porting the dot codegen to mlir, and here is the WIP PR. I've almost finished the mma16816-related logic, leaving the mma884 and fmadot WIP.

According to the discussion:

 %83 = triton_gpu.convert_layout %81 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %84 = triton_gpu.convert_layout %82 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
 %85 = tt.dot %83, %84, %cst_1 {allowTF32 = true} : tensor<64x64xf32, #shared> * tensor<64x64xf32, #shared> -> tensor<64x64xf32, #mma>

I plan to port the original visit_dot_instr to mlir. This function includes both the A&B loading from smem to registers and the codegen of MMA instruction. In my understanding, the former is the main logic of convert_layout(shared -> MMA). Please correct me if I am wrong.

@Jokeren

@Superjomn Thanks for the heads up. Yeah, I believe your are right. I'll go read your code for now without applying changes.

To accommodate the prefetch optimization, we will take out the shared memory load logic from the triton.dot, but code generation part should be mostly the same.

@goostavz
Copy link
Collaborator Author

goostavz commented Sep 20, 2022

Let me summarize the plans, just for the sake of clarity:

On our side:

  • @daadaada is working on pre-fetching shared memory into registers triton-mlir#104, that is: (1) Adding MatmulDotOperand; (2) changing the input layout requirements of mma_inst to take MatmulDotOperand<OutputLayout> as inputs; (3) Decomposing the mma instructions in pipelined matmul loops into two mma instruction, and compute + yield the operands of the first one at the end of the previous iterations.
  • @ptillet is still working on the runtime. The new runtime is merged into the master branch, and now I will backport it to the triton-mlir branch. This will make merging triton-mlir into master a breeze when the time comes. After this is done, I will start working on getting all non-matmul/non-reductions unit tests to pass.
  • @Jokeren was working last week on investigating bufferization. This is a longer term plan, but I'm happy we were able to isolate the potential problems. For now getting matmuls to work is probably the highest priority (on our side we'd like to get it working by October 1st), so I'm sure he'd also be happy to assist with any help you need. Maybe the reduction codegen would be better than our initial plan? @Jokeren is already familiar with visit_reduce_inst in the master branch

On your side:

  • @Superjomn is working on matmul codegen.
  • @goostavz Since the PR for blocked -> blocked conversion has been merged (🥳) and there is no more PR opened, I don't really know the plan.

I'm working on the rest of the layout_conversion now. I think these two are necessary for our next mile stone (basic gemm without nbuffer/prefetch optimization):
(1) blocked->shared
(2) mma->blocked
In my understanding (1) will not be that important for the longer term (it should be replaced with tensor.insert_slice_async to support LDGSTS in my understanding), but it is necessary for the moment.
Also scf lowering in the backend is needed, I don't expect there should be any actual effort though.

Considering of how Keren can better help us: The first choice i can think of is (1) to guarantee that the pipeline/prefetch optimization in the optimizer(including allocation) is OK. If you still have free bandwidth, (2) the codegen of tensor.slice/extract is another choice. And, (3) layout_conversion(blocked -> shared) is a third choice, which i haven't actually started for now. (3) will require heavily co-debug with dot codegen so I personally think (2) is a better choice than (3). What do you guys think?

@Jokeren
Copy link
Contributor

Jokeren commented Sep 20, 2022

(2) mma->blocked

@goostavz How this is related with LDGSTS?

(1) to guarantee that the pipeline/prefetch optimization in the optimizer(including allocation) is OK

I think @daadaada will take care of the prefetch optimization.

(2) the codegen of tensor.slice/extract is another choice.

Sounds good. tensor.extract_slice is straightforward but will eventually be replaced by memref.subview. Since tensor.insert_slice_async will use the async modifier in the load operation, I think a bunch of code will be similar to LoadOpConversion.

@goostavz
Copy link
Collaborator Author

goostavz commented Sep 21, 2022

@goostavz How this is related with LDGSTS?

Sorry it was a typo. 1 (not 2) is related with LDGSTS.

In this thread:
load A/B tile into shared memory is represented via a load and a convert_layout(blocked->shared).
In my understanding this should be replaced with tensor.insert_slice_async in order to support LDGSTS.
Pls correct me if i'm wrong.

@ptillet
Copy link
Collaborator

ptillet commented Sep 21, 2022

Yes, there are two ways to write the matmul loop so that the tiles of A and B end up being loaded from DRAM to shared memory:

  1. load + convert_layout(blocked->shared). Necessary for V100.
  2. tensor.insert_slice_async. Gets eventually lowered to LDGSTS.
    On architectures that have LDGSTS, the pipeline pass converts method 1) into method 2)

@Jokeren Jokeren pinned this issue Nov 19, 2022
@goostavz goostavz changed the title [Triton-MLIR][Backend] TritonGPUToLLVM [Triton-MLIR] Remaining issues in migration Nov 28, 2022
@yangjunpro yangjunpro unpinned this issue Feb 5, 2023
@ptillet
Copy link
Collaborator

ptillet commented Feb 23, 2023

Closing this as the MLIR rewrite is officially complete

@ptillet ptillet closed this as completed Feb 23, 2023
brunomazzottiamd pushed a commit to brunomazzottiamd/triton that referenced this issue Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants