-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sub-optimal codegen for llvm.experimental.vector.reduce of <N x i1> #38188
Comments
assigned to @RKSimon |
I don’t think I understand how any of these can be done with a single movmsk. Movmsk collects the sign bits of the elements into adjacent bits of a GPR. How is that equivalent to an i1 result? |
Also it’s a truncate to vXi1 in IR which would only grab the LSB. Movmsk grabs the MSB. So without knowing that it’s at sign bits that wouldn’t be interchangeable either |
Modified godbolt that uses comparisons: https://gcc.godbolt.org/z/ioTWEX |
Oops, it isn't! The result of movmsk would need to be compared against an appropriate integer value (the value for which the N first bits are set). The result of that comparison is equivalent to the i1 that the vector.reduce.and instruction returns. That is, IIUC the correct assembly would be and128_x4: which is ~1.25 cycles, and much better than what's currently being produced: and128_x4: which takes ~4 cycles on SKL. |
Why not just put a bitcast and icmp in IR? The reduction intrinsics are not really intended for i1 elements. |
I am not sure if the following is equivalent, but if the problem is i1 then I can also use i32: declare i32 @llvm.experimental.vector.reduce.and.v8i1(<8 x i32>); That code truncates to a vector of i1, and then sign-extends it back to i32, so that movmsk becomes valid IIUC. It still produces: and256_x8: # @and256_x8 |
Or do you meant with the icmp that I should do something like this? declare i32 @llvm.experimental.vector.reduce.and.v8i1(<8 x i32>); and256_x8: # @and256_x8 |
I meant don’t use the intrinsic at all. Truncate to vXi1. Bitcast to iX. icmp with 2^X-1. Sorry I’m writing from my phone so it’s hard to write IR or test myself at the moment. |
Don't worry! And sincere thanks for trying to help me get to the bottom of this! So let's see if I got this right: define i1 @and256_x8(<8 x i32>) { where the second version is the result of passing the first one through opt -O3. They produce the following machine code, which isn't great either AFAICT: .LCPI0_0: |
I'm working on a patch that will turn the the O3 form into this and256_x8_opt: # @and256_x8_opt %bb.0:
|
Most of this looks good in trunk now - test cases added at rL359385 using @llvm.experimental.vector.reduce.* for and/or uses MOVMSK pretty efficiently the bitcast variants also optimize well The only outstanding issue is @llvm.experimental.vector.reduce.xor.vXi1 reduction which isn't currently covered by combineHorizontalPredicateResult - but a MOVMSK followed by a parity check should work correctly for that. |
xor bool reduction: https://reviews.llvm.org/D61230 |
Committed at rL359396 |
mentioned in issue llvm/llvm-bugzilla-archive#51122 |
Extended Description
The llvm.experimental.vector.reduce.{and,or,xor} instructions of the x86 backend produce very sub-optimal machine code. See it live: https://gcc.godbolt.org/z/qIHi6D
LLVM-IR:
declare i1 @llvm.experimental.vector.reduce.and.v32i1(<32 x i1>);
declare i1 @llvm.experimental.vector.reduce.and.v8i1(<8 x i1>);
declare i1 @llvm.experimental.vector.reduce.and.v4i1(<4 x i1>);
declare i1 @llvm.experimental.vector.reduce.and.v2i1(<2 x i1>);
define i1 @and128_x2(<2 x i64>) {
%a = trunc <2 x i64> %0 to <2 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v2i1(<2 x i1> %a)
ret i1 %b
}
define i1 @and_x4(<4 x i32>) {
%a = trunc <4 x i32> %0 to <4 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v4i1(<4 x i1> %a)
ret i1 %b
}
define i1 @and128_x8(<8 x i8>) {
%a = trunc <8 x i8> %0 to <8 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v8i1(<8 x i1> %a)
ret i1 %b
}
define i1 @and256_x4(<4 x i64>) {
%a = trunc <4 x i64> %0 to <4 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v4i1(<4 x i1> %a)
ret i1 %b
}
define i1 @and_x8(<8 x i32>) {
%a = trunc <8 x i32> %0 to <8 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v8i1(<8 x i1> %a)
ret i1 %b
}
define i1 @and256_x32(<32 x i8>) {
%a = trunc <32 x i8> %0 to <32 x i1>
%b = call i1 @llvm.experimental.vector.reduce.and.v32i1(<32 x i1> %a)
ret i1 %b
}
produces
and128_x2: # @and128_x2
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
movd %xmm1, %eax
retq
and_x4: # @and_x4
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
pand %xmm1, %xmm0
movd %xmm0, %eax
retq
and128_x8: # @and128_x8
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
pand %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrld $16, %xmm1
pand %xmm0, %xmm1
movd %xmm1, %eax
retq
and256_x4: # @and256_x4
shufps $136, %xmm1, %xmm0 # xmm0 = xmm0[0,2],xmm1[0,2]
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
pand %xmm1, %xmm0
movd %xmm0, %eax
retq
and256_x8: # @and_x8
pshuflw $232, %xmm1, %xmm1 # xmm1 = xmm1[0,2,2,3,4,5,6,7]
pshufhw $232, %xmm1, %xmm1 # xmm1 = xmm1[0,1,2,3,4,6,6,7]
pshufd $232, %xmm1, %xmm1 # xmm1 = xmm1[0,2,2,3]
pshuflw $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3,4,5,6,7]
pshufhw $232, %xmm0, %xmm0 # xmm0 = xmm0[0,1,2,3,4,6,6,7]
pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3]
punpcklqdq %xmm1, %xmm0 # xmm0 = xmm0[0],xmm1[0]
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
pand %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrld $16, %xmm1
pand %xmm0, %xmm1
movd %xmm1, %eax
retq
and256_x32: # @and256_x32
pand %xmm1, %xmm0
pshufd $78, %xmm0, %xmm1 # xmm1 = xmm0[2,3,0,1]
pand %xmm0, %xmm1
pshufd $229, %xmm1, %xmm0 # xmm0 = xmm1[1,1,2,3]
pand %xmm1, %xmm0
movdqa %xmm0, %xmm1
psrld $16, %xmm1
pand %xmm0, %xmm1
movdqa %xmm1, %xmm0
psrlw $8, %xmm0
pand %xmm1, %xmm0
movd %xmm0, %eax
retq
but these should all lower to a single mvmsk instruction:
and128_x2:
movmskpd %xmm0, %eax
retq
and128_x4:
movmskps %xmm0, %eax
retq
and128_x8:
pmovmskb %xmm0, %eax
retq
and256_x4:
vmovmskpd %ymm0, %eax
vzeroupper
retq
and256_x8:
vmovmskps %ymm0, %eax
vzeroupper
retq
and256_x32:
vpmovmskb %ymm0, %eax
vzeroupper
retq1
The llvm.experimental.vector.reduce.and for <8 x i16>, <16 x i16>, <1 x i128>, <2 x i128>, etc. probably produce very sub-optimal machine code for i1 vectors as well.
The llvm.experimental.vector.reduce.or and llvm.experimental.vector.reduce.xor probably produce very sub-optimal machine code for all these i1 vectors too.
These llvm intrinsics are critical for efficiently performing coherent control flow.
The text was updated successfully, but these errors were encountered: