-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Triton-MLIR] Remaining issues in migration #673
Comments
FYI, I plan to take a look on |
Chunwei is now working on the current "mma codegen", (including load A/B from shared memory and mma, in the current mma ir, mma takes its input from shared layout, which will be decomposed into convert_layout(shared->mma) + mma(mma in/mma out) in the future ). The support of convert_layout(shared->mma) should just be some refactor on the code Chunwei is developing for now. So in my understanding, there are some overlap between the works you are doing for now. @Jokeren @chunwei BTW, i noticed in https://github.com/openai/triton-mlir/discussions/107, the IR is made of load + convert_layout(blocked->shared), in my understanding we need to convert it into tensor.insert_slice_async in order to support LDGSTS, is this already done? We may need to support convert_layout(blocked->shared) if this will not be done in short term, or else i think it should be of higher priority to support tensor.insert_slice_async -> ldgsts. |
BTW, i may need to confirm it first: do you refer to the support of convert_layout(shared->mma) in TritonGPUToLLVM, or, decomposing of dot(shared->mma) into convert_layout(shared->mma) + dot(mma->mma) in the optimizer? |
Yeah, I am porting the dot codegen to mlir, and here is the WIP PR. I've almost finished the mma16816-related logic, leaving the mma884 and fmadot WIP. According to the discussion: %83 = triton_gpu.convert_layout %81 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
%84 = triton_gpu.convert_layout %82 : (tensor<64x64xf32, #blocked1>) -> tensor<64x64xf32, #shared>
%85 = tt.dot %83, %84, %cst_1 {allowTF32 = true} : tensor<64x64xf32, #shared> * tensor<64x64xf32, #shared> -> tensor<64x64xf32, #mma> I plan to port the original |
Let me summarize the plans, just for the sake of clarity: On our side:
On your side:
|
@ptillet for reduction codegen, another NVIDIA colleague is working on that, he should be capable of submitting the PR in one or two weeks, that's why it is marked as "ongoing" in the original post. @goostavz knows more than me with the detailed plan since he is helping coordinate the related work in my team. |
@Superjomn Thanks for the heads up. Yeah, I believe your are right. I'll go read your code for now without applying changes. To accommodate the prefetch optimization, we will take out the shared memory load logic from the triton.dot, but code generation part should be mostly the same. |
I'm working on the rest of the layout_conversion now. I think these two are necessary for our next mile stone (basic gemm without nbuffer/prefetch optimization): Considering of how Keren can better help us: The first choice i can think of is (1) to guarantee that the pipeline/prefetch optimization in the optimizer(including allocation) is OK. If you still have free bandwidth, (2) the codegen of tensor.slice/extract is another choice. And, (3) layout_conversion(blocked -> shared) is a third choice, which i haven't actually started for now. (3) will require heavily co-debug with dot codegen so I personally think (2) is a better choice than (3). What do you guys think? |
@goostavz How this is related with LDGSTS?
I think @daadaada will take care of the prefetch optimization.
Sounds good. |
Sorry it was a typo. 1 (not 2) is related with LDGSTS. In this thread: |
Yes, there are two ways to write the matmul loop so that the tiles of A and B end up being loaded from DRAM to shared memory:
|
Closing this as the MLIR rewrite is officially complete |
Update in 2022-11-19, remaining issues from backend point of view
collectDeps(yieldOp->getOperand(arg.getArgNumber() - 1)
, whenarg.getArgNumber() = 0
. @JokerenTritonGPUToLLVM Pass:
Basic features
Load/StoreOp related:
Layout Conversion Related
MMA related:
basic Dot codegen
Dot with n_buffer/prefetch optimization
Completeness of op coverage
%
in test_core.py (semantics offrem
changed between NVPTX11 and NVPTX14; Triton no longer behaves correctly).The text was updated successfully, but these errors were encountered: