[X86] Suboptimal lowering of short vectors equality check: could use scalar types instead #53419

xortator · 2022-01-26T10:23:39Z

Motivating case: https://godbolt.org/z/rbE3TzqdP

The original test

define i1 @vector_version(i8* align 1 %arg, i8* align 1 %arg1, i32 %arg2) {
bb:
  %ptr1 = bitcast i8* %arg1 to <4 x i8>*
  %ptr2 = bitcast i8* %arg to <4 x i8>*
  %lhs = load <4 x i8>, <4 x i8>* %ptr1, align 1
  %rhs = load <4 x i8>, <4 x i8>* %ptr2, align 1
  %any_ne = icmp ne <4 x i8> %lhs, %rhs
  %any_ne_scalar = bitcast <4 x i1> %any_ne to i4
  %all_eq = icmp eq i4 %any_ne_scalar, 0
  ret i1 %all_eq
}

reads two short vector values and effectively checks that they are equal. Codegen generates vector code from it:

vector_version:                         # @vector_version
        vpmovzxbd       (%rsi), %xmm0           # xmm0 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
        vpmovzxbd       (%rdi), %xmm1           # xmm1 = mem[0],zero,zero,zero,mem[1],zero,zero,zero,mem[2],zero,zero,zero,mem[3],zero,zero,zero
        vpsubd  %xmm1, %xmm0, %xmm0
        vptest  %xmm0, %xmm0
        sete    %al
        retq

This code is semantically equivalent to its scalar counterpart

define i1 @scalar_version(i8* align 1 %arg, i8* align 1 %arg1, i32 %arg2) {
bb:
  %ptr1 = bitcast i8* %arg1 to i32*
  %ptr2 = bitcast i8* %arg to i32*
  %lhs = load i32, i32* %ptr1, align 1
  %rhs = load i32, i32* %ptr2, align 1
  %all_eq = icmp eq i32 %lhs, %rhs
  ret i1 %all_eq
}

which produces neater asm:

scalar_version:                         # @scalar_version
        movl    (%rsi), %eax
        cmpl    (%rdi), %eax
        sete    %al
        retq

Unfortunately we cannot use RM vector sub here as stated in #53416, but it looks like we could give up using vector registers at all.

Not sure what is the proper place for this - codegen or instcombine.

The text was updated successfully, but these errors were encountered:

xortator · 2022-01-26T10:36:42Z

Here is something in between: if we bitcast vector values to i32 after load, codegen can produce good code.
https://godbolt.org/z/bseaqYjjq

llvmbot · 2022-01-26T11:00:52Z

@llvm/issue-subscribers-backend-x86

xortator · 2022-01-26T11:27:25Z

3271f43 tests.

nunoplopes · 2022-01-26T12:33:50Z

Here is something in between: if we bitcast vector values to i32 after load, codegen can produce good code. https://godbolt.org/z/bseaqYjjq

You can't do that transformation at IR level; it's not sound because of poison values. It has to be delayed until the backend.

nunoplopes · 2022-01-26T12:36:29Z

Ah, nervermind, you are doing an and of all the comparisons. That's fine, yes.
(sorry, I misread it as doing 4 individual comparisons)

xortator · 2022-01-27T07:01:17Z

https://reviews.llvm.org/D118317 should address this pattern on IR level, but I still believe there should also be a codegen solution.

RKSimon · 2022-01-30T12:35:34Z

Several things need to be addressed:

MatchVectorAllZeroTest needs to be extended to handle the (icmp bitcast(iN (icmp vNiX V, 0) ), 0) style reduction pattern that we canonicalize to - this will catch the (legal) integer size patterns.

Extend MatchVectorAllZeroTest and LowerVectorAllZero so that they handle 'VectorAllEqual' patterns - PTEST lowering will need to perform a SUB(X,Y), but the MOVMSK can still use PCMPEQB.

The ExpandReductions pass should fold the allof(icmp(vector)) -> (icmp bitcast(iN (icmp (vector)) ), 0) canonicalization, we currently only perform this inside InstCombine so if anything else has generated this we might miss it.

…mp_eq()) / any_of(icmp_ne()) to integers Noticed while working on Issue #59867 and Issue #53419 - there's still more to do here, but for "all vector" comparisons, we should try to cast to a scalar integer for sub-128bit types

…fold to AVX512 targets Extends 1bb95a3 to combine on AVX512 targets where the vXi1 type is legal Continues work on addressing Issue #53419

…kortestw patterns Another step toward #53419 - this is also another step towards expanding MatchVectorAllZeroTest to match any pair of vectors and merge EmitAVX512Test into it.

…kortestw patterns (REAPPLIED) Another step toward #53419 - this is also another step towards expanding MatchVectorAllZeroTest to match any pair of vectors and merge EmitAVX512Test into it.

RKSimon · 2023-03-30T17:16:25Z

Several things need to be addressed:

MatchVectorAllZeroTest needs to be extended to handle the (icmp bitcast(iN (icmp vNiX V, 0) ), 0) style reduction pattern that we canonicalize to - this will catch the (legal) integer size patterns.

Candidate Patch: https://reviews.llvm.org/D147243

…,Y)),0) vector reduction patterns Many allof/anyof/noneof reduction patterns are canonicalized by bitcasting a vXi1 vector comparison result to iN and compared against 0/-1. This patch adds support for recognizing a icmp_ne vector comparison against 0, which matches an 'whole vectors are equal' comparison pattern. There are a few more steps to follow in future patches - we need to add support to MatchVectorAllZeroTest for comparing against -1 (in some cases), and this initial refactoring of LowerVectorAllZero to LowerVectorAllEqual needs to be extended so we can fully merge with the similar combineVectorSizedSetCCEquality code (which deals with scalar integer memcmp patterns). Another step towards Issue #53419 Differential Revision: https://reviews.llvm.org/D147243

RKSimon · 2023-04-03T16:42:14Z

Final Candidate Patch: https://reviews.llvm.org/D147452

… bitcast(<X x i1> V)) canonicalization This already exists in InstCombine but was missing from the late stage ExpandReductions pass Fixes llvm#53419 Fixes llvm#61923 Differential Revision: https://reviews.llvm.org/D147452

github-actions bot added the new issue label Jan 26, 2022

xortator changed the title ~~[X86] Suboptimal lowering of short vectors: could use scalar types instead~~ [X86] Suboptimal lowering of short vectors equality check: could use scalar types instead Jan 26, 2022

RKSimon added the backend:X86 label Jan 26, 2022

RKSimon self-assigned this Jan 26, 2022

EugeneZelenko removed the new issue label Jan 26, 2022

RKSimon added the confirmed Verified by a second party label Jan 30, 2022

RKSimon removed their assignment Jun 23, 2022

RKSimon added a commit that referenced this issue Mar 22, 2023

[X86] Extend all_of(icmp_eq()) / any_of(icmp_ne()) -> scalar integer …

ada0356

…fold to AVX512 targets Extends 1bb95a3 to combine on AVX512 targets where the vXi1 type is legal Continues work on addressing Issue #53419

RKSimon self-assigned this Mar 29, 2023

EugeneZelenko added the llvm:codegen label Apr 3, 2023

RKSimon closed this as completed in 00e3ae4 Apr 4, 2023

EugeneZelenko removed the backend:X86 label Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86] Suboptimal lowering of short vectors equality check: could use scalar types instead #53419

[X86] Suboptimal lowering of short vectors equality check: could use scalar types instead #53419

xortator commented Jan 26, 2022 •

edited

Loading

xortator commented Jan 26, 2022

llvmbot commented Jan 26, 2022

xortator commented Jan 26, 2022 •

edited

Loading

nunoplopes commented Jan 26, 2022

nunoplopes commented Jan 26, 2022 •

edited

Loading

xortator commented Jan 27, 2022

RKSimon commented Jan 30, 2022

RKSimon commented Mar 30, 2023

RKSimon commented Apr 3, 2023

[X86] Suboptimal lowering of short vectors equality check: could use scalar types instead #53419

[X86] Suboptimal lowering of short vectors equality check: could use scalar types instead #53419

Comments

xortator commented Jan 26, 2022 • edited Loading

xortator commented Jan 26, 2022

llvmbot commented Jan 26, 2022

xortator commented Jan 26, 2022 • edited Loading

nunoplopes commented Jan 26, 2022

nunoplopes commented Jan 26, 2022 • edited Loading

xortator commented Jan 27, 2022

RKSimon commented Jan 30, 2022

RKSimon commented Mar 30, 2023

RKSimon commented Apr 3, 2023

xortator commented Jan 26, 2022 •

edited

Loading

xortator commented Jan 26, 2022 •

edited

Loading

nunoplopes commented Jan 26, 2022 •

edited

Loading