[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

RKSimon · 2023-09-29T13:56:23Z

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
  %shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32>

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt <8 x i32> %0, %1
  %cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

Extend VectorCombine::foldBitcastShuf to handle length changing shuffles
Extend VectorCombine::foldBitcastShuf to handle binary shuffles
Add a VectorCombine::foldShuffleOfCasts similar to VectorCombine::foldShuffleOfBinops

llvmbot · 2023-10-02T04:44:27Z

@llvm/issue-subscribers-backend-x86

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;
  %shuffle.i.i = shufflevector &lt;2 x i64&gt; %3, &lt;2 x i64&gt; %4, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %7 = bitcast &lt;4 x i64&gt; %shuffle.i.i to &lt;8 x i32&gt;

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt &lt;8 x i32&gt; %0, %1
  %cmp.i = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %sext.i = sext &lt;4 x i1&gt; %cmp.i to &lt;4 x i32&gt;
  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %cmp.i20 = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 4, i32 5, i32 6, i32 7&gt;
  %sext.i21 = sext &lt;4 x i1&gt; %cmp.i20 to &lt;4 x i32&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

…ffles Allow length changing shuffle masks in the "bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'" fold. It also exposes some poor shuffle mask detection for extract/insert subvector cases inside improveShuffleKindFromMask First stage towards addressing Issue #67803

nico · 2023-10-06T12:06:29Z

This broke check-clang: http://45.33.8.238/linux/120114/step_7.txt

Please take a look and revert for now if it takes a while to fix.

RKSimon · 2023-10-06T12:29:13Z

Should be fixed by 32a9c09

llvmbot · 2024-02-21T21:29:20Z

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

In the comments of the issue, request for it to be assigned to you.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot · 2024-02-21T21:29:20Z

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;
  %shuffle.i.i = shufflevector &lt;2 x i64&gt; %3, &lt;2 x i64&gt; %4, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %7 = bitcast &lt;4 x i64&gt; %shuffle.i.i to &lt;8 x i32&gt;

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt &lt;8 x i32&gt; %0, %1
  %cmp.i = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %sext.i = sext &lt;4 x i1&gt; %cmp.i to &lt;4 x i32&gt;
  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %cmp.i20 = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 4, i32 5, i32 6, i32 7&gt;
  %sext.i21 = sext &lt;4 x i1&gt; %cmp.i20 to &lt;4 x i32&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

SahilPatidar · 2024-03-13T11:13:40Z

@RKSimon I'm interested in taking on this task, if it's still available.

RKSimon · 2024-03-13T11:35:45Z

I've been investigating this myself and its a much bigger task than I initially thought as the shuffle costs for length changing shuffles are so poor - I need to split this further to show the yak shaving involved

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for #67803

…APPLIED) Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Reapplied with a clang codegen test fix. Further prep work for #67803

…Subvector instead of PermuteTwoSrc We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector. But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly. Another step for #67803

…ts before creating a new bitcast on top Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run. I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for llvm#67803

…APPLIED) Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Reapplied with a clang codegen test fix. Further prep work for llvm#67803

…Subvector instead of PermuteTwoSrc We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector. But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly. Another step for llvm#67803

…ts before creating a new bitcast on top (#86119) Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

…ast(x),cast(y)) -> cast(shuffle(x,y)) Part of #67803

…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803

…(y)) -> cast(shuffle(x,y)) iff cost efficient (#87510) Based off the existing foldShuffleOfBinops fold Fixes #67803

…case llvm-mca numbers We were using raw instruction count which overestimated the costs for #67803

We are still missing a fold for shuffle(bitcast(sext(x)),bitcast(sext(y))) -> bitcast(sext(shuffle(x,y))) due to foldShuffleOfCastops failing to add new instructions back onto the worklist

RKSimon added the missed-optimization label Sep 29, 2023

EugeneZelenko added the backend:X86 label Oct 2, 2023

RKSimon added the good first issue https://github.com/llvm/llvm-project/contribute label Feb 21, 2024

RKSimon added a commit that referenced this issue Mar 5, 2024

[PhaseOrdering][X86] Add test coverage for #67803

08e036e

RKSimon removed the good first issue https://github.com/llvm/llvm-project/contribute label Mar 13, 2024

RKSimon self-assigned this Mar 20, 2024

RKSimon added a commit that referenced this issue Mar 20, 2024

[VectorCombine] foldBitcastShuf - add support for binary shuffles

2ac85d8

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for #67803

RKSimon mentioned this issue Mar 21, 2024

[VectorCombine] foldBitcastShuffle - peek through any residual bitcasts before creating a new bitcast on top #86119

Merged

chencha3 pushed a commit to chencha3/llvm-project that referenced this issue Mar 23, 2024

[VectorCombine] foldBitcastShuf - add support for binary shuffles

6bcebc3

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for llvm#67803

RKSimon added a commit that referenced this issue Apr 3, 2024

[VectorCombine][X86] Add some tests showing failure to fold shuffle(c…

4d8a3f5

…ast(x),cast(y)) -> cast(shuffle(x,y)) Part of #67803

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast…

a12447a

…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803

RKSimon mentioned this issue Apr 3, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast(y)) -> cast(shuffle(x,y)) iff cost efficient #87510

Merged

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast…

160c237

…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast…

cc8b45f

…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803

RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 4, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast…

364e915

…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803

RKSimon closed this as completed in #87510 Apr 4, 2024

RKSimon added a commit that referenced this issue Apr 4, 2024

[VectorCombine][X86] foldShuffleOfCastops - fold shuffle(cast(x),cast…

212b2bb

…(y)) -> cast(shuffle(x,y)) iff cost efficient (#87510) Based off the existing foldShuffleOfBinops fold Fixes #67803

RKSimon added a commit that referenced this issue Apr 4, 2024

[CostModel][X86] Update AVX1 sext v8i1 -> v8i32 cost based off worst …

3871eab

…case llvm-mca numbers We were using raw instruction count which overestimated the costs for #67803

RKSimon added a commit that referenced this issue Apr 4, 2024

[CostModel][X86] Update AVX1 sext v4i1 -> v4i64 cost based off worst …

ed41249

…case llvm-mca numbers We were using raw instruction count which overestimated the costs for #67803

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

RKSimon commented Sep 29, 2023 •

edited

Loading

llvmbot commented Oct 2, 2023

nico commented Oct 6, 2023

RKSimon commented Oct 6, 2023

llvmbot commented Feb 21, 2024

llvmbot commented Feb 21, 2024

SahilPatidar commented Mar 13, 2024

RKSimon commented Mar 13, 2024

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

Comments

RKSimon commented Sep 29, 2023 • edited Loading

llvmbot commented Oct 2, 2023

nico commented Oct 6, 2023

RKSimon commented Oct 6, 2023

llvmbot commented Feb 21, 2024

llvmbot commented Feb 21, 2024

SahilPatidar commented Mar 13, 2024

RKSimon commented Mar 13, 2024

RKSimon commented Sep 29, 2023 •

edited

Loading