-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867
Comments
@llvm/issue-subscribers-backend-x86 |
The generated code depends on the exact target; the transform looks a lot more reasonable with test: # @test
pmovzxbq xmm0, word ptr [rdi] # xmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
pmovzxbq xmm1, word ptr [rsi] # xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
psubq xmm0, xmm1
ptest xmm0, xmm0
sete al
ret |
I'll take a look at this |
On AArch64, this also doesn't look profitable to me: https://godbolt.org/z/oMrd8zav5 It is not exactly the same case, but very similar to a regression that I am looking at where SLP vectorisation causes a 5% slowdown overall compared to scalar code. Like in this case, the problem is the overhead of the inserts and extracts and the small vectorisation factor. I was looking at SLP vectoriser when I thought about checking the GH issues and found this. @RKSimon : would you mind adding me as a reviewer/subscriber to your fix? I would like to check if it solves my case too, or if there's more to do in this area. |
It's looking more and more like I'm going to fix this in the DAG - the vectorizer reduction costs don't have good enough access to the source value to recognise the ALLOF/ANYOF + ICMP_EQ/NE pattern, but in DAG its going to be relatively trivial to bitcast the <2 x i8> values to i16 and just compare the scalar directly. |
Could we just add an InstCombine peephole from: define i1 @src(ptr %s1, ptr %s2) {
%1 = load <2 x i8>, ptr %s1, align 1
%2 = load <2 x i8>, ptr %s2, align 1
%3 = icmp eq <2 x i8> %1, %2
%4 = extractelement <2 x i1> %3, i32 0
%5 = extractelement <2 x i1> %3, i32 1
%res = select i1 %4, i1 %5, i1 false
ret i1 %res
} to define i1 @tgt(ptr %s1, ptr %s2) {
%1 = load <2 x i8>, ptr %s1, align 1
%2 = load <2 x i8>, ptr %s2, align 1
%3 = icmp eq <2 x i8> %1, %2
%4 = freeze <2 x i1> %3
%5 = bitcast <2 x i1> %4 to i2
%res = icmp eq i2 %5, -1
ret i1 %res
} |
@sjoerdmeijer In the end the best solution was to increase the costs for subvector load/stores less than 32-bits wide, as for most x86 targets that means we have to scalarize it and then transfer to/from the FPU. |
Local branch amd-gfx 6822a81 Merged main:8ac8c922fb3f into amd-gfx:83d58f188c87 Remote branch main 1746c78 [X86] Add DAG test coverage for Issue llvm#59867 patterns
https://llvm.godbolt.org/z/hdnbn1EGq
Results in:
This doesn't look like a profitable vectorization to me. Resulting codegen looks as follows:
The text was updated successfully, but these errors were encountered: