[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

nikic · 2023-01-06T21:42:26Z

https://llvm.godbolt.org/z/hdnbn1EGq

; RUN: opt -S -passes=slp-vectorizer < %s
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i1 @test(ptr %s1, ptr %s2) {
  %v1.1 = load i8, ptr %s1, align 1
  %v2.1 = load i8, ptr %s2, align 1
  %c1 = icmp eq i8 %v1.1, %v2.1
  %s1.2 = getelementptr inbounds i8, ptr %s1, i64 1
  %v1.2 = load i8, ptr %s1.2, align 1
  %s2.2 = getelementptr inbounds i8, ptr %s2, i64 1
  %v2.2 = load i8, ptr %s2.2, align 1
  %c2 = icmp eq i8 %v1.2, %v2.2
  %res = select i1 %c1, i1 %c2, i1 false
  ret i1 %res
}

Results in:

define i1 @test(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = extractelement <2 x i1> %3, i32 0
  %5 = extractelement <2 x i1> %3, i32 1
  %res = select i1 %4, i1 %5, i1 false
  ret i1 %res
}

This doesn't look like a profitable vectorization to me. Resulting codegen looks as follows:

test:                                   # @test
        movzx   eax, byte ptr [rdi]
        movzx   ecx, byte ptr [rdi + 1]
        xor     al, byte ptr [rsi]
        xor     cl, byte ptr [rsi + 1]
        or      cl, al
        sete    al
        ret
test2:                                  # @test2
        movzx   eax, word ptr [rdi]
        movd    xmm0, eax
        movzx   eax, word ptr [rsi]
        movd    xmm1, eax
        pcmpeqb xmm1, xmm0
        punpcklbw       xmm0, xmm1              # xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
        pshuflw xmm0, xmm0, 96                  # xmm0 = xmm0[0,0,2,1,4,5,6,7]
        pshufd  xmm0, xmm0, 80                  # xmm0 = xmm0[0,0,1,1]
        movmskpd        eax, xmm0
        cmp     al, 3
        sete    al
        ret

llvmbot · 2023-01-06T21:42:41Z

@llvm/issue-subscribers-backend-x86

efriedma-quic · 2023-01-06T22:07:41Z

The generated code depends on the exact target; the transform looks a lot more reasonable with -mattr=+sse4.1:

test:                                   # @test
        pmovzxbq        xmm0, word ptr [rdi]            # xmm0 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        pmovzxbq        xmm1, word ptr [rsi]            # xmm1 = mem[0],zero,zero,zero,zero,zero,zero,zero,mem[1],zero,zero,zero,zero,zero,zero,zero
        psubq   xmm0, xmm1
        ptest   xmm0, xmm0
        sete    al
        ret

RKSimon · 2023-01-06T22:23:14Z

I'll take a look at this

sjoerdmeijer · 2023-01-10T15:51:14Z

On AArch64, this also doesn't look profitable to me:

https://godbolt.org/z/oMrd8zav5

It is not exactly the same case, but very similar to a regression that I am looking at where SLP vectorisation causes a 5% slowdown overall compared to scalar code. Like in this case, the problem is the overhead of the inserts and extracts and the small vectorisation factor.

I was looking at SLP vectoriser when I thought about checking the GH issues and found this. @RKSimon : would you mind adding me as a reviewer/subscriber to your fix? I would like to check if it solves my case too, or if there's more to do in this area.

RKSimon · 2023-02-08T18:06:24Z

It's looking more and more like I'm going to fix this in the DAG - the vectorizer reduction costs don't have good enough access to the source value to recognise the ALLOF/ANYOF + ICMP_EQ/NE pattern, but in DAG its going to be relatively trivial to bitcast the <2 x i8> values to i16 and just compare the scalar directly.

…mp_eq()) / any_of(icmp_ne()) to integers Noticed while working on Issue #59867 and Issue #53419 - there's still more to do here, but for "all vector" comparisons, we should try to cast to a scalar integer for sub-128bit types

RKSimon · 2023-04-04T10:29:01Z

Could we just add an InstCombine peephole from:

define i1 @src(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = extractelement <2 x i1> %3, i32 0
  %5 = extractelement <2 x i1> %3, i32 1
  %res = select i1 %4, i1 %5, i1 false
  ret i1 %res
}

to

define i1 @tgt(ptr %s1, ptr %s2) {
  %1 = load <2 x i8>, ptr %s1, align 1
  %2 = load <2 x i8>, ptr %s2, align 1
  %3 = icmp eq <2 x i8> %1, %2
  %4 = freeze <2 x i1> %3
  %5 = bitcast <2 x i1> %4 to i2
  %res = icmp eq i2 %5, -1
  ret i1 %res
}

RKSimon · 2023-04-24T08:27:48Z

@sjoerdmeijer In the end the best solution was to increase the costs for subvector load/stores less than 32-bits wide, as for most x86 targets that means we have to scalarize it and then transfer to/from the FPU.

Local branch amd-gfx 6822a81 Merged main:8ac8c922fb3f into amd-gfx:83d58f188c87 Remote branch main 1746c78 [X86] Add DAG test coverage for Issue llvm#59867 patterns

nikic added backend:X86 llvm:SLPVectorizer labels Jan 6, 2023

nikic mentioned this issue Jan 6, 2023

Suboptimal eq compilation on structs compared to equivalent C++ code rust-lang/rust#106269

Closed

RKSimon self-assigned this Jan 6, 2023

RKSimon added a commit that referenced this issue Apr 23, 2023

[X86] Add DAG test coverage for Issue #59867 patterns

1746c78

RKSimon added a commit that referenced this issue Apr 23, 2023

[SLP][X86] Add test coverage for Issue #59867

97927c3

RKSimon closed this as completed in aca5f9a Apr 23, 2023

EugeneZelenko removed the llvm:SLPVectorizer label Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

nikic commented Jan 6, 2023 •

edited by VoltrexKeyva

Loading

llvmbot commented Jan 6, 2023

efriedma-quic commented Jan 6, 2023 •

edited by VoltrexKeyva

Loading

RKSimon commented Jan 6, 2023

sjoerdmeijer commented Jan 10, 2023

RKSimon commented Feb 8, 2023

RKSimon commented Apr 4, 2023

RKSimon commented Apr 24, 2023

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

[SLPVectorizer] Unprofitable vectorization of i8 icmp eqs #59867

Comments

nikic commented Jan 6, 2023 • edited by VoltrexKeyva Loading

llvmbot commented Jan 6, 2023

efriedma-quic commented Jan 6, 2023 • edited by VoltrexKeyva Loading

RKSimon commented Jan 6, 2023

sjoerdmeijer commented Jan 10, 2023

RKSimon commented Feb 8, 2023

RKSimon commented Apr 4, 2023

RKSimon commented Apr 24, 2023

nikic commented Jan 6, 2023 •

edited by VoltrexKeyva

Loading

efriedma-quic commented Jan 6, 2023 •

edited by VoltrexKeyva

Loading