Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803

Closed
3 tasks done
RKSimon opened this issue Sep 29, 2023 · 7 comments · Fixed by #87510
Closed
3 tasks done

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Sep 29, 2023

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
  %shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32>

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt <8 x i32> %0, %1
  %cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %sext.i = sext <4 x i1> %cmp.i to <4 x i32>
  %3 = bitcast <4 x i32> %sext.i to <2 x i64>
  %cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
  %4 = bitcast <4 x i32> %sext.i21 to <2 x i64>

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

  • Extend VectorCombine::foldBitcastShuf to handle length changing shuffles
  • Extend VectorCombine::foldBitcastShuf to handle binary shuffles
  • Add a VectorCombine::foldShuffleOfCasts similar to VectorCombine::foldShuffleOfBinops
@llvmbot
Copy link
Member

llvmbot commented Oct 2, 2023

@llvm/issue-subscribers-backend-x86

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;
  %shuffle.i.i = shufflevector &lt;2 x i64&gt; %3, &lt;2 x i64&gt; %4, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %7 = bitcast &lt;4 x i64&gt; %shuffle.i.i to &lt;8 x i32&gt;

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt &lt;8 x i32&gt; %0, %1
  %cmp.i = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %sext.i = sext &lt;4 x i1&gt; %cmp.i to &lt;4 x i32&gt;
  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %cmp.i20 = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 4, i32 5, i32 6, i32 7&gt;
  %sext.i21 = sext &lt;4 x i1&gt; %cmp.i20 to &lt;4 x i32&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

RKSimon added a commit that referenced this issue Oct 6, 2023
…ffles

Allow length changing shuffle masks in the "bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'" fold.

It also exposes some poor shuffle mask detection for extract/insert subvector cases inside improveShuffleKindFromMask

First stage towards addressing Issue #67803
@nico
Copy link
Contributor

nico commented Oct 6, 2023

This broke check-clang: http://45.33.8.238/linux/120114/step_7.txt

Please take a look and revert for now if it takes a while to fix.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Oct 6, 2023

Should be fixed by 32a9c09

@RKSimon RKSimon added the good first issue https://github.com/llvm/llvm-project/contribute label Feb 21, 2024
@llvmbot
Copy link
Member

llvmbot commented Feb 21, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. In the comments of the issue, request for it to be assigned to you.
  2. Fix the issue locally.
  3. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  4. Create a Git commit.
  5. Run git clang-format HEAD~1 to format your changes.
  6. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Member

llvmbot commented Feb 21, 2024

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

https://godbolt.org/z/Waonx44Mj

For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:

#if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
 _mm256_setr_m128i( \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
	_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )

#define _mm256_blendv_epi8( a, b, c ) \
 _mm256_setr_m128i( \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
	_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif

__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
    __m256i cmp = _mm256_cmpgt_epi32(x,y);
    return _mm256_blendv_epi8(a,b,cmp);
}

This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.

In particular we see this pattern a lot:

  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;
  %shuffle.i.i = shufflevector &lt;2 x i64&gt; %3, &lt;2 x i64&gt; %4, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %7 = bitcast &lt;4 x i64&gt; %shuffle.i.i to &lt;8 x i32&gt;

We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.

We also see :

  %2 = icmp sgt &lt;8 x i32&gt; %0, %1
  %cmp.i = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 0, i32 1, i32 2, i32 3&gt;
  %sext.i = sext &lt;4 x i1&gt; %cmp.i to &lt;4 x i32&gt;
  %3 = bitcast &lt;4 x i32&gt; %sext.i to &lt;2 x i64&gt;
  %cmp.i20 = shufflevector &lt;8 x i1&gt; %2, &lt;8 x i1&gt; poison, &lt;4 x i32&gt; &lt;i32 4, i32 5, i32 6, i32 7&gt;
  %sext.i21 = sext &lt;4 x i1&gt; %cmp.i20 to &lt;4 x i32&gt;
  %4 = bitcast &lt;4 x i32&gt; %sext.i21 to &lt;2 x i64&gt;

We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)

@SahilPatidar
Copy link
Contributor

@RKSimon I'm interested in taking on this task, if it's still available.

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 13, 2024

I've been investigating this myself and its a much bigger task than I initially thought as the shuffle costs for length changing shuffles are so poor - I need to split this further to show the yak shaving involved

@RKSimon RKSimon removed the good first issue https://github.com/llvm/llvm-project/contribute label Mar 13, 2024
@RKSimon RKSimon self-assigned this Mar 20, 2024
RKSimon added a commit that referenced this issue Mar 20, 2024
Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'".

Further prep work for #67803
RKSimon added a commit that referenced this issue Mar 20, 2024
…APPLIED)

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'".

Reapplied with a clang codegen test fix.

Further prep work for #67803
RKSimon added a commit that referenced this issue Mar 21, 2024
…Subvector instead of PermuteTwoSrc

We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector.

But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly.

Another step for #67803
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Mar 21, 2024
…ts before creating a new bitcast on top

Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Mar 22, 2024
…ts before creating a new bitcast on top

Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.

I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?
chencha3 pushed a commit to chencha3/llvm-project that referenced this issue Mar 23, 2024
Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'".

Further prep work for llvm#67803
chencha3 pushed a commit to chencha3/llvm-project that referenced this issue Mar 23, 2024
…APPLIED)

Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'".

Reapplied with a clang codegen test fix.

Further prep work for llvm#67803
chencha3 pushed a commit to chencha3/llvm-project that referenced this issue Mar 23, 2024
…Subvector instead of PermuteTwoSrc

We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector.

But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly.

Another step for llvm#67803
RKSimon added a commit that referenced this issue Apr 2, 2024
…ts before creating a new bitcast on top (#86119)

Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.
RKSimon added a commit that referenced this issue Apr 3, 2024
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024
…(y)) -> cast(shuffle(x,y)) iff cost efficient

Based off the existing foldShuffleOfBinops fold

Fixes llvm#67803
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024
…(y)) -> cast(shuffle(x,y)) iff cost efficient

Based off the existing foldShuffleOfBinops fold

Fixes llvm#67803
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 3, 2024
…(y)) -> cast(shuffle(x,y)) iff cost efficient

Based off the existing foldShuffleOfBinops fold

Fixes llvm#67803
RKSimon added a commit to RKSimon/llvm-project that referenced this issue Apr 4, 2024
…(y)) -> cast(shuffle(x,y)) iff cost efficient

Based off the existing foldShuffleOfBinops fold

Fixes llvm#67803
RKSimon added a commit that referenced this issue Apr 4, 2024
…(y)) -> cast(shuffle(x,y)) iff cost efficient (#87510)

Based off the existing foldShuffleOfBinops fold

Fixes #67803
RKSimon added a commit that referenced this issue Apr 4, 2024
…case llvm-mca numbers

We were using raw instruction count which overestimated the costs for #67803
RKSimon added a commit that referenced this issue Apr 4, 2024
…case llvm-mca numbers

We were using raw instruction count which overestimated the costs for #67803
RKSimon added a commit that referenced this issue Apr 11, 2024
We are still missing a fold for shuffle(bitcast(sext(x)),bitcast(sext(y))) -> bitcast(sext(shuffle(x,y))) due to foldShuffleOfCastops failing to add new instructions back onto the worklist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants