[AMD] Add missing dependency to TritonAMDGPUIR #5053

antiagainst · 2024-11-03T06:01:45Z

TritonAMDGPUTransforms now depends on it.

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

* [AMD] Emit vectorized 16-bit float LLVM atomic ops (triton-lang#4925) In the case of 16 bit floats operands for tt::AtomicRMWOp, construct only one LLVM::AtomicRMWOp but use vector of elements. Such approach allows to generate packed intrinsics and process 2 elements at once. Added a lit test for f16 vectorized case. (cherry picked from commit 78c8054) * [AMD] Restructure ReorderInstructions pass (triton-lang#4998) (cherry picked from commit 86a2ac7) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053) TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce) * [AMD] Support warp-level reduction with DPP (triton-lang#5019) This commit adds support for warp-level reduction with DPP instructions, which can improve performance. See https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/ (cherry picked from commit 21119e3) * [AMD] Use DPP to accelerate 16-bit floats (triton-lang#5072) In the case of unpaired f16 elements utilize DPP instructions to accelerate atomics. Here is an algorithm of lowering `tt::atomicRmwOp(%ptr, %val, %mask)`: 0. Group thread by pairs. Master thread is (tid % 2 == 0); 1. All the threads send `%val` to `(tid - 1)` thread via `dppUpdateOp shl`, so all the masters recieve value from secondary threads; 2. Take into account parity in the `%mask` value, build CF structures according to it; 3. Generate `llvm::atomicRmwOp` in the threads enabled by `%mask` value; 4. All the threads send result of generated operation to `(tid + 1)` thread via `dppUpdateOp shl`, so all secondary thread also recieve their result. DPP approach has ~5% perf improvment so use this one in the case target arch supports DPP. Signed-off-by: Ilya Veselov <[email protected]> (cherry picked from commit bab3470) * [AMD] Reland sinking the 2nd tt.load after local_load's (triton-lang#4935) This PR adds more restrictions about when should we apply the sched-load optimizations and un-revert triton-lang#4823. We will only apply the optimization when all of the following is satisfied: 1. pureMatmulProblem, i.e. 1 `tt.dot` in the main loop 2. two `tt.load`s in the main loop 3. 2nd `tt.load` is ahead of the `tt.dot` 4. 1st user of 2nd `tt.load` is after the `tt.dot` 5. tile size is large enough, i.e. nonKDim >= 128 and kDim >= 64 (cherry picked from commit 4f6f768) --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]>

Cherry pick list: - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#5002 - triton-lang#4935 - required additional cherry picks triton-lang#4991 and triton-lang#4951 - triton-lang#4998 - triton-lang#4925 - triton-lang#5281 - triton-lang#5308 - All previous LLVM hash PRs before triton-lang#5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> (cherry picked from commit 2d8093c)

Reverts triton-lang#5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - triton-lang#5308 (and previous LLVM upgrades) - triton-lang#5281 - triton-lang#4925 - triton-lang#5053 - triton-lang#5019 - triton-lang#4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]> (cherry picked from commit 7e401df)

Cherry pick list: - #4925 - #5053 - #5019 - #5002 - #4935 - required additional cherry picks #4991 and #4951 - #4998 - #4925 - #5281 - #5308 - All previous LLVM hash PRs before #5308 --------- Co-authored-by: Ilya V <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Lixun Zhang <[email protected]> Co-authored-by: Keren Zhou <[email protected]> Co-authored-by: Alexander Efimov <[email protected]> Co-authored-by: Kyle Wang <[email protected]> Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]>

Reverts #5191 due to some mlir errors in pytorch unit tests Smaller set of cherry picks: - #5308 (and previous LLVM upgrades) - #5281 - #4925 - #5053 - #5019 - #4998 --------- Co-authored-by: Jungwook Park <[email protected]> Co-authored-by: peterbell10 <[email protected]> Co-authored-by: Hongtao Yu <[email protected]> Co-authored-by: Lei Zhang <[email protected]> Co-authored-by: Ilya V <[email protected]> Co-authored-by: Kyle Wang <[email protected]>

[AMD] Add missing dependency

2fea11f

antiagainst force-pushed the amd-include branch from e42290a to 2fea11f Compare November 3, 2024 06:22

antiagainst changed the title ~~[AMD] Prefix generated .inc files with third_party/ to be exact~~ [AMD] Add missing dependency to TritonAMDGPUIR Nov 3, 2024

antiagainst marked this pull request as ready for review November 3, 2024 06:23

antiagainst requested a review from zhanglx13 as a code owner November 3, 2024 06:23

ThomasRaoux approved these changes Nov 3, 2024

View reviewed changes

antiagainst enabled auto-merge (squash) November 3, 2024 06:27

antiagainst merged commit 0b443ce into triton-lang:main Nov 3, 2024
6 of 7 checks passed

Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

64f7d2b

TritonAMDGPUTransforms now depends on it.

guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

86ccfbc

TritonAMDGPUTransforms now depends on it.

jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

73cb81c

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo pushed a commit to jataylo/triton that referenced this pull request Nov 18, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

2527a67

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo mentioned this pull request Nov 19, 2024

[AMD] release/3.2.x AMD perf cherry picks #5191

Merged

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 5, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

383054b

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo mentioned this pull request Dec 5, 2024

[AMD] rc/3.2.x cherry picks #5347

Merged

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 11, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

a04867a

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo mentioned this pull request Dec 12, 2024

[Release/3.2.x] AMD Cherry Picks #5413

Closed

jataylo pushed a commit to jataylo/triton that referenced this pull request Dec 13, 2024

[AMD] Add missing dependency to TritonAMDGPUIR (triton-lang#5053)

37cec47

TritonAMDGPUTransforms now depends on it. (cherry picked from commit 0b443ce)

jataylo mentioned this pull request Dec 13, 2024

[CP] AMD Performance cherry picks ROCm/triton#682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add missing dependency to TritonAMDGPUIR #5053

[AMD] Add missing dependency to TritonAMDGPUIR #5053

antiagainst commented Nov 3, 2024 •

edited

Loading

[AMD] Add missing dependency to TritonAMDGPUIR #5053

[AMD] Add missing dependency to TritonAMDGPUIR #5053

Conversation

antiagainst commented Nov 3, 2024 • edited Loading

antiagainst commented Nov 3, 2024 •

edited

Loading