-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large vector sizes failure - cpu compilation - quantised models #18005
Comments
@pashu123 this seems like a codegen bug. Please take a look |
I took a look this morning, but I did not get time to write down my observation. I'll do it soon |
Here is the IR before vectorization. It is very similar to what @pashu123 and I saw in broadcast + mmt4d fusion. The dequant op is not fused into the reduction loops, so it ends up with a large vector size and a large stack buffer. Based on one of discussion we had few weeks ago, it's worth to fuse the dequant op into reduction loop even there are redundant computation. One of goals is to have less memory load/store in the compute body, so we want to fuse them into reduction loops. This is also what we've done in llama2 performance burn down (i.e., i16i4i32 ukernel). So I think the fix could be updating // -----// IR Dump Before GenericVectorization (iree-codegen-generic-vectorization) //----- //
func.func @largeVectorMinRepro_dispatch_0_pooling_nchw_sum_1x320x1x1x65x65_f32() attributes {translation_info = #iree_codegen.translation_info<CPUConvTileAndDecomposeExpert>} {
%c5 = arith.constant 5 : index
%c1 = arith.constant 1 : index
%c65 = arith.constant 65 : index
%c16 = arith.constant 16 : index
%c32 = arith.constant 32 : index
%c320 = arith.constant 320 : index
%cst = arith.constant 1.250000e-01 : f32
%cst_0 = arith.constant 0.000000e+00 : f32
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<1x320x65x65xi8>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<1x320x1x1xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%2 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
%3 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
scf.for %arg0 = %2 to %c320 step %3 {
%4 = flow.dispatch.tensor.load %1, offsets = [0, %arg0, 0, 0], sizes = [1, 32, 1, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<1x320x1x1xf32>> -> tensor<1x32x1x1xf32>
%5 = flow.dispatch.tensor.load %0, offsets = [0, %arg0, 0, 0], sizes = [1, 32, 65, 65], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<1x320x65x65xi8>> -> tensor<1x32x65x65xi8>
%6 = scf.for %arg1 = %c0 to %c32 step %c16 iter_args(%arg2 = %4) -> (tensor<1x32x1x1xf32>) {
%extracted_slice = tensor.extract_slice %5[0, %arg1, 0, 0] [1, 16, 65, 65] [1, 1, 1, 1] : tensor<1x32x65x65xi8> to tensor<1x16x65x65xi8>
%7 = tensor.empty() : tensor<1x16x65x65xf32>
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%extracted_slice : tensor<1x16x65x65xi8>) outs(%7 : tensor<1x16x65x65xf32>) {
^bb0(%in: i8, %out: f32):
%11 = arith.extsi %in : i8 to i32
%12 = arith.sitofp %11 : i32 to f32
%13 = arith.mulf %12, %cst : f32
linalg.yield %13 : f32
} -> tensor<1x16x65x65xf32>
%extracted_slice_1 = tensor.extract_slice %arg2[0, %arg1, 0, 0] [1, 16, 1, 1] [1, 1, 1, 1] : tensor<1x32x1x1xf32> to tensor<1x16x1x1xf32>
%9 = linalg.fill ins(%cst_0 : f32) outs(%extracted_slice_1 : tensor<1x16x1x1xf32>) -> tensor<1x16x1x1xf32>
%10 = scf.for %arg3 = %c0 to %c65 step %c1 iter_args(%arg4 = %9) -> (tensor<1x16x1x1xf32>) {
%11 = scf.for %arg5 = %c0 to %c65 step %c5 iter_args(%arg6 = %arg4) -> (tensor<1x16x1x1xf32>) {
%extracted_slice_2 = tensor.extract_slice %8[0, 0, %arg3, %arg5] [1, 16, 1, 5] [1, 1, 1, 1] : tensor<1x16x65x65xf32> to tensor<1x16x1x5xf32>
%12 = tensor.empty() : tensor<1x5xf32>
%extracted_slice_3 = tensor.extract_slice %extracted_slice_2[0, 0, 0, 0] [1, 16, 1, 5] [1, 1, 1, 1] : tensor<1x16x1x5xf32> to tensor<1x16x5xf32>
%extracted_slice_4 = tensor.extract_slice %12[0, 0] [1, 5] [1, 1] : tensor<1x5xf32> to tensor<5xf32>
%extracted_slice_5 = tensor.extract_slice %arg6[0, 0, 0, 0] [1, 16, 1, 1] [1, 1, 1, 1] : tensor<1x16x1x1xf32> to tensor<1x16x1xf32>
%13 = linalg.pooling_ncw_sum {dilations = dense<1> : vector<1xi64>, strides = dense<1> : vector<1xi64>} ins(%extracted_slice_3, %extracted_slice_4 : tensor<1x16x5xf32>, tensor<5xf32>) outs(%extracted_slice_5 : tensor<1x16x1xf32>) -> tensor<1x16x1xf32>
%inserted_slice_6 = tensor.insert_slice %13 into %arg6[0, 0, 0, 0] [1, 16, 1, 1] [1, 1, 1, 1] : tensor<1x16x1xf32> into tensor<1x16x1x1xf32>
scf.yield %inserted_slice_6 : tensor<1x16x1x1xf32>
}
scf.yield %11 : tensor<1x16x1x1xf32>
}
%inserted_slice = tensor.insert_slice %10 into %arg2[0, %arg1, 0, 0] [1, 16, 1, 1] [1, 1, 1, 1] : tensor<1x16x1x1xf32> into tensor<1x32x1x1xf32>
scf.yield %inserted_slice : tensor<1x32x1x1xf32>
}
flow.dispatch.tensor.store %6, %1, offsets = [0, %arg0, 0, 0], sizes = [1, 32, 1, 1], strides = [1, 1, 1, 1] : tensor<1x32x1x1xf32> -> !flow.dispatch.tensor<writeonly:tensor<1x320x1x1xf32>>
}
return
} |
Why can't we just add
This is what I see before generic vectorization and the program successfully compiles:
|
I think there are numeric issues because you initialize the Let's take gemm as an example. Code1 is what you're doing when fuse the fill op, but what we want is Code2. Does it make sense? Code1:
Code2:
(You can try e2e tests with your suggestion, I think it will generate wrong outputs.) |
Previously, we only tilled the reduction tile sizes and did not fuse them with the producers from the input operands. It led to transfer read/write with large vector sizes since the dequant operation materialised its own tensor and wasn't fused inside the reduction loop. This pass tiles the reduction dimension and fuses the operations arising from the input operand of the already tiled operation. Issue link: #18005 Most of the code is borrowed from https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/GPU/GPUApplyTilingLevel.cpp Signed-off-by: hanhanW <[email protected]>
Got same large vector sizes failure for onnx models dpn68_vaiq/dpn92_vaiq/dpn98_vaiq/dpn107_vaiq/dpn131_vaiq/skresnet34_vaiq/skresnet18_vaiq/DeepLabV3_resnet50_vaiq_int8/RAFT_vaiq_int8/U-2-Net_vaiq_int8 in public onnx storage. Here is one of the detailed log: dpn68_vaiq_iree_failed.log |
@AmosLewis Please check out the following commit and try b0b3dea |
There are no issues. You can use the latest commit as well. The previous one was more tested. Could you use |
@pashu123
module_main_graph_dispatch_47.mlir
|
Looking into the kernel, the two generics can be fused if the tensor expand_shape is propagated upward or downward, and the first operand of the second generic is either expanded or squashed to 3D or 1D. @hanhanW, do you have any suggestions? My take is to enable linalg element-wise fusion on dispatches of this kind. |
Looking into the failure https://gist.github.com/AmosLewis/35ed28904fd6e82de0c66546b18579df#file-dpn68_vaiq_iree_failed-log The problem is with the fusion of producer. |
I created a repro IR at https://gist.github.com/pashu123/d52f974975f0ebcfa6b131d076660e70, and it was successfully compiled. After bubbling up the tensor.expand op, elementwise fusion took place. |
@MaheshRavishankar @IanWood1 why do we have a tensor.expand_shape in between? I thought that the reshape ops become If it is expected, do we apply some cleanups (like what Prashant mentioned) at flow level? It is beneficial to all the backends. |
We shouldn't have them in between, they dont get cloned. Its possible that this came from I'm going to take a look at that |
…onsumerProducer Pass (#18114) Previously, we only tilled the reduction tile sizes and did not fuse them with the producers from the input operands. It led to transfer read/write with large vector sizes since the dequant operation materialised its own tensor and wasn't fused inside the reduction loop. Adds a `onlyFuseProducerInputOperands` option to the tile-root-and-fuse-consumer-producer-pass. If the option is set to true, it tiles the reduction dimension and fuses the operations arising from the input operand of the already tiled operation. Issue link: #18005
There are a number of reasons why the backend might emit a
@hanhanW do we want to try to emit a more descriptive error? Or at least have a more explicit check for ops we know cant be fused? But maybe providing a descriptive error here is more difficult than my understanding. |
My understanding is that all the compute ops in the ssa-chain should implement TilingInterface. Otherwise, we don't have much things to do in codegen. So (1), (2) and (3) are problematic codegen input to me. It does not only happen on CPU backend, but also happens in all other backends. Basically you won't be able to distribute the workload without modifying the graph. If the conclusion is that we always want to update the graph, then the dispatch creation should generate such dispatch for backends. Thus, I think we could add (I know that we have an option that fuses everything to a single dispatch, but it is not the default behavior. CPU could handle the case in a non-sense way which is very slow.) |
It's hard to write such verifiers cause they are dependent on implementation status of codegen and not tied to any "real" constraints (like large vectors are bad, or large stack allocations are bad). IIUC this is a bug. And we will need to investigate cause for every time we hit this error (it's basically a catch all for "something went off the rails"). Not sure we can really do better than that |
What happened?
On compiling a model with int8 quantization one of the dispatches fails to compile with the following error:
Min repro adapted from the failing dispatch:
compile command :
iree-compile --iree-input-demote-i64-to-i32 --iree-hal-target-backends=llvm-cpu largevectorissue.minrepro.mlir -o test.vmfb
host issue here
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: