-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VectorCombine][X86] Poor handling of compare-select patterns with AVX2 spoofing on AVX1 targets #67803
Comments
@llvm/issue-subscribers-backend-x86
https://godbolt.org/z/Waonx44Mj
For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like: #if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
_mm256_setr_m128i( \
_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )
#define _mm256_blendv_epi8( a, b, c ) \
_mm256_setr_m128i( \
_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif
__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
__m256i cmp = _mm256_cmpgt_epi32(x,y);
return _mm256_blendv_epi8(a,b,cmp);
} This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types. In particular we see this pattern a lot: %3 = bitcast <4 x i32> %sext.i to <2 x i64>
%4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
%shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32> We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling. We also see : %2 = icmp sgt <8 x i32> %0, %1
%cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%3 = bitcast <4 x i32> %sext.i to <2 x i64>
%cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
%4 = bitcast <4 x i32> %sext.i21 to <2 x i64> We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops) |
…ffles Allow length changing shuffle masks in the "bitcast (shuf V, MaskC) --> shuf (bitcast V), MaskC'" fold. It also exposes some poor shuffle mask detection for extract/insert subvector cases inside improveShuffleKindFromMask First stage towards addressing Issue #67803
This broke check-clang: http://45.33.8.238/linux/120114/step_7.txt Please take a look and revert for now if it takes a while to fix. |
Should be fixed by 32a9c09 |
Hi! This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below. |
@llvm/issue-subscribers-good-first-issue Author: Simon Pilgrim (RKSimon)
https://godbolt.org/z/Waonx44Mj
For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like: #if !defined(__AVX2__)
#define _mm256_cmpgt_epi32( a, b ) \
_mm256_setr_m128i( \
_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ) ), \
_mm_cmpgt_epi32( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ) ) )
#define _mm256_blendv_epi8( a, b, c ) \
_mm256_setr_m128i( \
_mm_blendv_epi8( _mm256_extractf128_si256( (a), 0 ), _mm256_extractf128_si256( (b), 0 ), _mm256_extractf128_si256( (c), 0 ) ), \
_mm_blendv_epi8( _mm256_extractf128_si256( (a), 1 ), _mm256_extractf128_si256( (b), 1 ), _mm256_extractf128_si256( (c), 1 ) ) )
#endif
__m256i cmpsel_epi8(__m256i x, __m256i y, __m256i a, __m256i b) {
__m256i cmp = _mm256_cmpgt_epi32(x,y);
return _mm256_blendv_epi8(a,b,cmp);
} This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types. In particular we see this pattern a lot: %3 = bitcast <4 x i32> %sext.i to <2 x i64>
%4 = bitcast <4 x i32> %sext.i21 to <2 x i64>
%shuffle.i.i = shufflevector <2 x i64> %3, <2 x i64> %4, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%7 = bitcast <4 x i64> %shuffle.i.i to <8 x i32> We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling. We also see : %2 = icmp sgt <8 x i32> %0, %1
%cmp.i = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%sext.i = sext <4 x i1> %cmp.i to <4 x i32>
%3 = bitcast <4 x i32> %sext.i to <2 x i64>
%cmp.i20 = shufflevector <8 x i1> %2, <8 x i1> poison, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%sext.i21 = sext <4 x i1> %cmp.i20 to <4 x i32>
%4 = bitcast <4 x i32> %sext.i21 to <2 x i64> We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops) |
@RKSimon I'm interested in taking on this task, if it's still available. |
I've been investigating this myself and its a much bigger task than I initially thought as the shuffle costs for length changing shuffles are so poor - I need to split this further to show the yak shaving involved |
Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for #67803
…APPLIED) Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Reapplied with a clang codegen test fix. Further prep work for #67803
…Subvector instead of PermuteTwoSrc We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector. But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly. Another step for #67803
…ts before creating a new bitcast on top Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run. I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?
…ts before creating a new bitcast on top Encountered while working on llvm#67803, this helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run. I'm assuming we can't safely put this inside IRBuilderBase.CreateBitCast?
Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Further prep work for llvm#67803
…APPLIED) Generalise fold to "bitcast (shuf V0, V1, MaskC) --> shuf (bitcast V0), (bitcast V1), MaskC'". Reapplied with a clang codegen test fix. Further prep work for llvm#67803
…Subvector instead of PermuteTwoSrc We don't have a concat_vector shuffle kind and improveShuffleKindFromMask won't alter the base type to match it as InsertSubvector. But since this is how X86 will lower concat_vector anyhow, just recognise it explicitly. Another step for llvm#67803
…ts before creating a new bitcast on top (#86119) Encountered while working on #67803, wading through the chains of bitcasts that SSE intrinsics introduces - this patch helps prevents cases where the bitcast chains aren't cleared out and we can't perform further combines until after InstCombine/InstSimplify has run.
…ast(x),cast(y)) -> cast(shuffle(x,y)) Part of #67803
…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803
…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803
…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803
…(y)) -> cast(shuffle(x,y)) iff cost efficient Based off the existing foldShuffleOfBinops fold Fixes llvm#67803
…case llvm-mca numbers We were using raw instruction count which overestimated the costs for #67803
…case llvm-mca numbers We were using raw instruction count which overestimated the costs for #67803
We are still missing a fold for shuffle(bitcast(sext(x)),bitcast(sext(y))) -> bitcast(sext(shuffle(x,y))) due to foldShuffleOfCastops failing to add new instructions back onto the worklist
https://godbolt.org/z/Waonx44Mj
For AVX1 only targets we often encounter 'fake-AVX2' code for integer math like:
This is really poorly optimized, mainly due to all the bitcasts to/from the __m128i (<2 x i64>) types.
In particular we see this pattern a lot:
We should be able to get VectorCombine to fold this to a <8 x i32> shufflevector instead, in fact VectorCombine::foldBitcastShuf might handle this if we extend it to binary shuffles, with improved cost handling.
We also see :
We've managed to combine to a single <8 x i32> icmp , but failed to rejoin the compare result sign extensions. We should be able to handle this in VectorCombine if we handle concatenation of casts (based off what we do in VectorCombine::foldShuffleOfBinops)
The text was updated successfully, but these errors were encountered: