-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Codegen] Add vector transfer + slice foldings in GenericVectorization #17613
[Codegen] Add vector transfer + slice foldings in GenericVectorization #17613
Conversation
needs tests, but I wanted to run benchmarks EDIT: I see this is breaking several tests. I'll need to look into this more |
4179cc3
to
810abaa
Compare
Abbreviated Benchmark Summary@ commit 780cfc42654569975d94f85536a0f58c0de7742a (no previous benchmark results to compare) Data-Tiling Comparison TableClick to show
Raw Latencies
[Top 3 out of 92 results showed] No improved or regressed compilation metrics 🏖️ For more information: |
compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_rocm.mlir
Show resolved
Hide resolved
This needs to wait on a bug fix in llvm/llvm-project#95020. |
2ad77bf
to
03ab8a2
Compare
df56e9e
to
5681df6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We have to be careful of folding tensor.extract_slice and tensor.insert_slice ops away because some transformations heavily rely on these ops. It looks good on CPU side because there are no further tiling after vectorization. The rest is vector lowering.
I'm not sure how it works on GPU side wrt scf.forall lowering. I think it is okay because you provide an option to turn it off.
(cc @banach-space because of the changes of vectorize_with_masking_and_hoist.mlir
.)
compiler/src/iree/compiler/Codegen/LLVMGPU/test/conv_pipeline_test_rocm.mlir
Show resolved
Hide resolved
Signed-off-by: Max Dawkins <[email protected]>
Signed-off-by: Max Dawkins <[email protected]>
93f2925
to
12e5c75
Compare
Signed-off-by: Max Dawkins <[email protected]>
cc @benmxwl-arm since I just had to update a |
Sorry, wrong account. @MacDue |
Looks harmless (it's still infers the right vector size :)) |
Heads up - I see test failures in these modified tests after merge:
Postsubmit CI jobs are running behind a 12h queue right now, so I kicked off a presubmit run at https://github.com/iree-org/iree/actions/runs/10014109365/job/27683126524?pr=17971 on #17971 to see if the CI machines can repro. Local logs: https://gist.github.com/ScottTodd/cfceb22a41ca80257918d1d468b05ddb |
Thanks for the heads-up! I'm taking a look and will provide a fix. |
Linux CI seemed to pass... I can try my local Windows at an earlier commit 🤔 Do the logs give any clues? |
That's weird. I think it should generate deterministic IRs... compiling IREE on linux now
I don't see issues, I need to repro it on my local machine. |
On Windows/MSVC, those tests passed at 4a13331 so something in this range caused them to fail: 4a13331...5b112cb Ben suggests
|
Weeeeeird. I can't repro now, after trying to bisect through that history. Friday afternoon doing Friday afternoon things... Sorry for the noise :P |
I've yet to track down exactly what's wrong, but this change breaks one of our internal tests, resulting in an output that is all NaNs. |
Update:
|
Say that the original input is: %extract = tensor.extract_slice %source
%read = vector.transfer_read %extract
%write = vector.transfer_read %dest
%insert = tensor.insert_slice %write into %dest Without the change, it becomes a %src = memref.subview (...)
%dest = memref.subview (...)
memref.copy (...) There are no vector codes, which leads to scalar load/store. While we don't rely on LLVM vectorizer, they are scalar loads/stores without additional vectorization at buffer level. The trend is moving the vectorization to tensor's world, so we want to get: vector.transfer_read %src
vector.transfer_write %read, %dest This preserves the vector code after bufferization: %src = memref.subview (...)
%read = vector.transfer_read %src ...
%dest = memref.subview (...)
vector.transfer_write %read, %dest ... Does it make sense? I don't remember the hoisting issues that we hit on ARM path. Could you share the expected IR before and after the hoisting? |
So before this change we'd get: %15 = scf.for %arg4 = %c0 to %c352 step %c1 iter_args(%arg5 = %14) -> (tensor) { %extracted_slice_7 = tensor.extract_slice %extracted_slice[%arg4, 0] [1, %10] [1, 1] : tensor<352x?xf32> to tensor<1x?xf32> %extracted_slice_8 = tensor.extract_slice %extracted_slice_2[%arg4, 0] [1, %11] [1, 1] : tensor<352x?xf32> to tensor<1x?xf32> %extracted_slice_9 = tensor.extract_slice %arg5[0, 0] [%10, %11] [1, 1] : tensor to tensor %17 = vector.create_mask %c1, %10 : vector<1x[8]xi1> %18 = vector.transfer_read %extracted_slice_7[%c0, %c0], %cst_1, %17 {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[8]xf32> %19 = vector.create_mask %c1, %11 : vector<1x[8]xi1> %20 = vector.transfer_read %extracted_slice_8[%c0, %c0], %cst_1, %19 {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[8]xf32> %21 = vector.create_mask %10, %11 : vector<[8]x[8]xi1> %22 = vector.transfer_read %extracted_slice_9[%c0, %c0], %cst_1, %21 {in_bounds = [true, true]} : tensor, vector<[8]x[8]xf32> %23 = vector.create_mask %10, %11, %c1 : vector<[8]x[8]x1xi1> %24 = vector.mask %23 { vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind} %18, %20, %22 : vector<1x[8]xf32>, vector<1x[8]xf32> into vector<[8]x[8]xf32> } : vector<[8]x[8]x1xi1> -> vector<[8]x[8]xf32> %25 = vector.transfer_write %24, %extracted_slice_9[%c0, %c0], %21 {in_bounds = [true, true]} : vector<[8]x[8]xf32>, tensor %inserted_slice_10 = tensor.insert_slice %25 into %arg5[0, 0] [%10, %11] [1, 1] : tensor into tensor scf.yield %inserted_slice_10 : tensor } And %21 = vector.transfer_read %extracted_slice_5[%c0, %c0], %cst_1, %19 {in_bounds = [true, true]} : tensor, vector<[8]x[8]xf32> %22 = scf.for %arg4 = %c0 to %c352 step %c1 iter_args(%arg5 = %21) -> (vector<[8]x[8]xf32>) { %extracted_slice_10 = tensor.extract_slice %extracted_slice[%arg4, 0] [1, %14] [1, 1] : tensor<352x?xf32> to tensor<1x?xf32> %25 = vector.transfer_read %4[%arg4, %arg0], %cst_1 {in_bounds = [true, true]} : tensor<352x128xf32>, vector<1x[8]xf32> %26 = vector.transfer_read %extracted_slice_10[%c0, %c0], %cst_1, %18 {in_bounds = [true, true]} : tensor<1x?xf32>, vector<1x[8]xf32> %27 = vector.mask %20 { vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind} %25, %26, %arg5 : vector<1x[8]xf32>, vector<1x[8]xf32> into vector<[8]x[8]xf32> } : vector<[8]x[8]x1xi1> -> vector<[8]x[8]xf32> scf.yield %27 : vector<[8]x[8]xf32> } %23 = vector.transfer_write %22, %extracted_slice_5[%c0, %c0], %19 {in_bounds = [true, true]} : vector<[8]x[8]xf32>, tensor Which is super nice, and eventually means the initial It would hoist the pair:
Then the pair:
Note: These are both easy to hoist as they're matching pairs to the same value ( Now we get: %16 = scf.for %arg4 = %c0 to %c352 step %c1 iter_args(%arg5 = %15) -> (tensor) { %extracted_slice_5 = tensor.extract_slice %arg5[0, 0] [%c8_vscale, %12] [1, 1] : tensor to tensor %18 = vector.transfer_read %4[%arg4, %arg0], %cst_1 {in_bounds = [true, true]} : tensor<352x128xf32>, vector<1x[8]xf32> %19 = vector.create_mask %c1, %12 : vector<1x[8]xi1> %20 = vector.transfer_read %5[%arg4, %arg2], %cst_1, %19 {in_bounds = [true, true]} : tensor<352x1xf32>, vector<1x[8]xf32> %21 = vector.create_mask %c8_vscale, %12 : vector<[8]x[8]xi1> %22 = vector.transfer_read %arg5[%c0, %c0], %cst_1, %21 {in_bounds = [true, true]} : tensor, vector<[8]x[8]xf32> %23 = vector.create_mask %c8_vscale, %12, %c1 : vector<[8]x[8]x1xi1> %24 = vector.mask %23 { vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind} %18, %20, %22 : vector<1x[8]xf32>, vector<1x[8]xf32> into vector<[8]x[8]xf32> } : vector<[8]x[8]x1xi1> -> vector<[8]x[8]xf32> %25 = vector.transfer_write %24, %extracted_slice_5[%c0, %c0], %21 {in_bounds = [true, true]} : vector<[8]x[8]xf32>, tensor %inserted_slice_6 = tensor.insert_slice %25 into %arg5[0, 0] [%c8_vscale, %12] [1, 1] : tensor into tensor scf.yield %inserted_slice_6 : tensor } Here's no pairs So %21 = scf.for %arg4 = %c0 to %c352 step %c1 iter_args(%arg5 = %17) -> (tensor) { %extracted_slice_7 = tensor.extract_slice %arg5[0, 0] [%c8_vscale, %14] [1, 1] : tensor to tensor %23 = vector.transfer_read %4[%arg4, %arg0], %cst_1 {in_bounds = [true, true]} : tensor<352x128xf32>, vector<1x[8]xf32> %24 = vector.transfer_read %5[%arg4, %arg2], %cst_1, %18 {in_bounds = [true, true]} : tensor<352x1xf32>, vector<1x[8]xf32> %25 = vector.transfer_read %arg5[%c0, %c0], %cst_1, %19 {in_bounds = [true, true]} : tensor, vector<[8]x[8]xf32> %26 = vector.mask %20 { vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d2, d0)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind} %23, %24, %25 : vector<1x[8]xf32>, vector<1x[8]xf32> into vector<[8]x[8]xf32> } : vector<[8]x[8]x1xi1> -> vector<[8]x[8]xf32> %27 = vector.transfer_write %26, %extracted_slice_7[%c0, %c0], %19 {in_bounds = [true, true]} : vector<[8]x[8]xf32>, tensor %inserted_slice_8 = tensor.insert_slice %27 into %arg5[0, 0] [%c8_vscale, %14] [1, 1] : tensor into tensor scf.yield %inserted_slice_8 : tensor } This leaves very costly full-ZA tile reads/writes within in inner most loop of a matmul, which is really bad. |
Thanks for all the impressive digging, @MacDue ! Sadly this is hurting SME pretty bad :( (Sorry I didn't check earlier when @hanhanW pinged me). @Max191, is @tiled_linalg_copy a good representation of what you wanted to achieve? I see 3 options here:
Thanks! |
The root of the issue is that With that said, it looks like there is just a missing pattern that is blocking hoisting in the above example. I think we should do both of the following:
becomes the easily hoistable (and potentially cleaner IR)
|
Had a quick look a 2.: The |
I see, that makes sense. In that case I think the
To give a little more context, for GPU at some point (post bufferization) we really want to end up with these transfers, in particular because any time there is a |
…orization (iree-org#17613)" This reverts commit 8b83425. This change is hurting SVE+SME performance pretty badly. See iree-org#17613 for context.
#17997 :) I know that Ben is having a quick look at the pattern. I'm mostly keen to get our CI to a happier state. From what you are saying, this PR was mostly to keep GPU and CPU paths consistent rather than fixing any specific CPU issue? |
IIUC this PR was to change the behavior on GPU and opted to change the behavior on CPU also. The red flag for me is that this is disabled for |
…orization (iree-org#17613)" This reverts commit 8b83425. This change is hurting SVE+SME performance pretty badly. See iree-org#17613 for context. Signed-off-by: Andrzej Warzynski <[email protected]>
…orization (#17613)" (#17997) This reverts commit 8b83425. This change is hurting SVE+SME performance pretty badly. See #17613 for context. Signed-off-by: Andrzej Warzynski <[email protected]>
iree-org#17613) Vectorizing a `linalg.copy` op can result in a sequence of ``` %extract = tensor.extract_slice %source %read = vector.transfer_read %extract %write = vector.transfer_read %dest %insert = tensor.insert_slice %write into %dest ``` This sequence is folded by the transfer_write folder into ``` %extract = tensor.extract_slice %source %insert = tensor.insert_slice %extract into %dest ``` In order to conserve the vector transfers, this PR adds folding patterns for vector transfer ops acting on insert/extract slice ops. This will fold the insert_slice into the transfer_write and the extract_slice into the transfer_read, and the vector transfers will not be folded. This is turned off for the vector distribution pipeline because it causes distribution to fail in some cases. Also removes `Codegen/LLVMGPU/test/conv_pipeline_test_rocm.mlir`, since it completes a TODO to remove the test after eliminating some undesired extra buffers. --------- Signed-off-by: Max Dawkins <[email protected]> Signed-off-by: Lubo Litchev <[email protected]>
…orization (iree-org#17613)" (iree-org#17997) This reverts commit 8b83425. This change is hurting SVE+SME performance pretty badly. See iree-org#17613 for context. Signed-off-by: Andrzej Warzynski <[email protected]> Signed-off-by: Lubo Litchev <[email protected]>
Btw, I forgot to mention but when I took a look at the folds I spotted at least one upstream bug, which I reported here: llvm/llvm-project#101708 |
…torization (iree-org#17613)" (iree-org#17997) This reverts commit 10877f6.
…torization (iree-org#17613)" (iree-org#17997) This reverts commit 10877f6.
…torization (iree-org#17613)" (iree-org#17997) This reverts commit 10877f6.
…torization (iree-org#17613)" (iree-org#17997) This reverts commit 10877f6.
Vectorizing a
linalg.copy
op can result in a sequence ofThis sequence is folded by the transfer_write folder into
In order to conserve the vector transfers, this PR adds folding patterns for vector transfer ops acting on insert/extract slice ops. This will fold the insert_slice into the transfer_write and the extract_slice into the transfer_read, and the vector transfers will not be folded.
This is turned off for the vector distribution pipeline because it causes distribution to fail in some cases.
Also removes
Codegen/LLVMGPU/test/conv_pipeline_test_rocm.mlir
, since it completes a TODO to remove the test after eliminating some undesired extra buffers.