Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

Closed
nikic opened this issue Jan 6, 2023 · 7 comments
Closed

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

nikic opened this issue Jan 6, 2023 · 7 comments
Assignees

Comments

@nikic
Copy link
Contributor

nikic commented Jan 6, 2023

https://llvm.godbolt.org/z/hdnbn1EGq

; RUN: opt -S -passes=slp-vectorizer < %s
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i1 @test(ptr %s1, ptr %s2) {
  %v1.1 = load i8, ptr %s1, align 1
  %v2.1 = load i8, ptr %s2, align 1
  %c1 = icmp eq i8 %v1.1, %v2.1
  %s1.2 = getelementptr inbounds i8, ptr %s1, i64 1
  %v1.2 = load i8, ptr %s1.2, align 1
  %s2.2 = getelementptr inbounds i8, ptr %s2, i64 1
  %v2.2 = load i8, ptr %s2.2, align 1
  %c2 = icmp eq i8 %v1.2, %v2.2
  %res = select i1 %c1, i1 %c2, i1 false
  ret i1 %res
}

Results in:

define i1 @test(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = extractelement <2 x i1> %3, i32 0
  %5 = extractelement <2 x i1> %3, i32 1
  %res = select i1 %4, i1 %5, i1 false
  ret i1 %res
}

This doesn't look like a profitable vectorization to me. Resulting codegen looks as follows:

test:                                   # @test
        movzx   eax, byte ptr [rdi]
        movzx   ecx, byte ptr [rdi + 1]
        xor     al, byte ptr [rsi]
        xor     cl, byte ptr [rsi + 1]
        or      cl, al
        sete    al
        ret
test2:                                  # @test2
        movzx   eax, word ptr [rdi]
        movd    xmm0, eax
        movzx   eax, word ptr [rsi]
        movd    xmm1, eax
        pcmpeqb xmm1, xmm0
        punpcklbw       xmm0, xmm1              # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        pshuflw xmm0, xmm0, 96                  # xmm0 = xmm0[0,0,2,1,4,5,6,7]
        pshufd  xmm0, xmm0, 80                  # xmm0 = xmm0[0,0,1,1]
        movmskpd        eax, xmm0
        cmp     al, 3
        sete    al
        ret
@llvmbot
Copy link
Member

llvmbot commented Jan 6, 2023

@llvm/issue-subscribers-backend-x86

@efriedma-quic
Copy link
Collaborator

efriedma-quic commented Jan 6, 2023

The generated code depends on the exact target; the transform looks a lot more reasonable with -mattr=+sse4.1:

test:                                   # @test
        pmovzxbq        xmm0, word ptr [rdi]            # xmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        pmovzxbq        xmm1, word ptr [rsi]            # xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        psubq   xmm0, xmm1
        ptest   xmm0, xmm0
        sete    al
        ret

@RKSimon RKSimon self-assigned this Jan 6, 2023
@RKSimon
Copy link
Collaborator

RKSimon commented Jan 6, 2023

I'll take a look at this

@sjoerdmeijer
Copy link
Collaborator

On AArch64, this also doesn't look profitable to me:

https://godbolt.org/z/oMrd8zav5

It is not exactly the same case, but very similar to a regression that I am looking at where SLP vectorisation causes a 5% slowdown overall compared to scalar code. Like in this case, the problem is the overhead of the inserts and extracts and the small vectorisation factor.

I was looking at SLP vectoriser when I thought about checking the GH issues and found this. @RKSimon : would you mind adding me as a reviewer/subscriber to your fix? I would like to check if it solves my case too, or if there's more to do in this area.

@RKSimon
Copy link
Collaborator

RKSimon commented Feb 8, 2023

It's looking more and more like I'm going to fix this in the DAG - the vectorizer reduction costs don't have good enough access to the source value to recognise the ALLOF/ANYOF + ICMP_EQ/NE pattern, but in DAG its going to be relatively trivial to bitcast the <2 x i8> values to i16 and just compare the scalar directly.

RKSimon added a commit that referenced this issue Feb 12, 2023
…mp_eq()) / any_of(icmp_ne()) to integers

Noticed while working on Issue #59867 and Issue #53419 - there's still more to do here, but for "all vector" comparisons, we should try to cast to a scalar integer for sub-128bit types
@RKSimon
Copy link
Collaborator

RKSimon commented Apr 4, 2023

Could we just add an InstCombine peephole from:

define i1 @src(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = extractelement <2 x i1> %3, i32 0
  %5 = extractelement <2 x i1> %3, i32 1
  %res = select i1 %4, i1 %5, i1 false
  ret i1 %res
}

to

define i1 @tgt(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = freeze <2 x i1> %3
  %5 = bitcast <2 x i1> %4 to i2
  %res = icmp eq i2 %5, -1
  ret i1 %res
}

@RKSimon
Copy link
Collaborator

RKSimon commented Apr 24, 2023

@sjoerdmeijer In the end the best solution was to increase the costs for subvector load/stores less than 32-bits wide, as for most x86 targets that means we have to scalarize it and then transfer to/from the FPU.

Guzhu-AMD pushed a commit to GPUOpen-Drivers/llvm-project that referenced this issue Apr 27, 2023
Local branch amd-gfx 6822a81 Merged main:8ac8c922fb3f into amd-gfx:83d58f188c87
Remote branch main 1746c78 [X86] Add DAG test coverage for Issue llvm#59867 patterns
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants