-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inefficient code for Nxi1 masked operations on non-avx512 target #53760
Comments
This seems to be happening in an optimization pass that transforms the This llvm-ir generates optimized assembly using define void @add_masked_i32x8_bitcast_sext(<8 x i32>* noalias nocapture sret(<8 x i32>) dereferenceable(32) %0, <8 x i32>* noalias nocapture readonly dereferenceable(32) %a, <8 x i32>* noalias nocapture readonly dereferenceable(32) %b, i8 %bitmask) unnamed_addr #0 {
%_6 = load <8 x i32>, <8 x i32>* %a, align 32
%_7 = load <8 x i32>, <8 x i32>* %b, align 32
%2 = bitcast i8 %bitmask to <8 x i1>
%3 = sext <8 x i1> %2 to <8 x i32>
%4 = and <8 x i32> %3, %_7
%5 = add <8 x i32> %4, %_6
store <8 x i32> %5, <8 x i32>* %0, align 32
ret void
} Optimizing using |
@llvm/issue-subscribers-backend-x86 |
The optimized code passed to the backend is where it all fall apart: https://godbolt.org/z/zGaEK77b8
|
Replace the *_EXTEND node with the raw operands, this will make it easier to use combineToExtendBoolVectorInReg for any boolvec extension combine. Cleanup prep for Issue #53760
… NFC. Avoid the need for a forward declaration. Cleanup prep for Issue #53760
…t(vXi1),A,B) For pre-AVX512 targets, attempt to sign-extend a vXi1 condition mask to pass to a X86ISD::BLENDV node Fixes Issue #53760
@jhorstmann rust.godbolt.org seems to be taking forever to update the nightly build with my fix - can you confirm if it worked at all? |
@RKSimon Thanks for the fast solution! I think in the last godbolt link you posted (https://godbolt.org/z/zGaEK77b8) I already see the difference if I compare 13.0 and trunk. I'm not familiar yet with the llvm dev process, should I just close the ticket now? Will the fix become part of release 14 or the next release afterwards? |
@jhorstmann Thanks for the confirmation! If you have a local trunk build of rust that you can see the fix then please close this. Otherwise, lets leave it open for now to see if the nightly godbolt build catches up, just in case I've missed something. |
@RKSimon I managed to build a rustc with your 3 commits included and can confirm this fixes the issue. My benchmark, that is summing 16k f64 numbers if their corresponding bit in a vector of u64 is set, got a speedup of more than 2x on an i7-10510U. I think another small improvement might come from #53791. AFAIK rust is currently tracking the |
When replacing all |
OK, I'll try to find a more general solution to this |
@lukaslihotzki please can you test against trunk latest? |
@RKSimon I don't have set up LLVM compilation locally. Can I use llc from https://alive2.llvm.org/ce/ for testing? How recent is llc there? This issue affects all non-x86 platforms I have tested (https://godbolt.org/z/n6d1fhE7d):
|
If you wait 24 hours then compiler explorer usually catches up to trunk. The fix is only for x86 - its unlikely that we can generalize it to work on other targets. |
The faster-than-scalar solution that can be generalized is: broadcast, bitwise AND then compare with [1,2,4,8,16,32,64,128]. Currently, LLVM trunk uses this approach for Can this general approach be implemented for LLVM in a target-independent way, still allowing targets to use better solutions like AVX-512 mask registers or SVE predicate registers? If not, would it be helpful to create an issue for every target? Alternatively, should Rust portable_simd use this approach directly instead of |
Replace the *_EXTEND node with the raw operands, this will make it easier to use combineToExtendBoolVectorInReg for any boolvec extension combine. Cleanup prep for Issue llvm#53760
… NFC. Avoid the need for a forward declaration. Cleanup prep for Issue llvm#53760
…t(vXi1),A,B) For pre-AVX512 targets, attempt to sign-extend a vXi1 condition mask to pass to a X86ISD::BLENDV node Fixes Issue llvm#53760
…neToExtendBoolVectorInReg before legalization This replaces the attempt in 20af71f to use combineToExtendBoolVectorInReg to create X86ISD::BLENDV masks directly, instead we use it to canonicalize the iX bitcast to a sign-extended mask and then truncate it back to vXi1 prior to legalization breaking it apart. Fixes llvm#53760
Can confirm that mask generation now uses the optimized path for My benchmark using portable_simd for reference, checked the generated code with |
I'll see if I can move any of the X86 implementation into TargetLowering so other targets can reuse it. |
I noticed the following code generation issue when trying Rust's new portable_simd library, using rust nightly build. The full example can be seen in the godbolt compiler explorer.
The original rust code is does an addition of two
i32
vectors, the second operand should be masked using anu8
bitmask.This compiles to the following llvm-ir (using
-C opt-level=3 --emit=llvm-ir
):With an avx512 capable target, the generated code looks good, the generated vector mask gets optimized away and replaced by a masked load using
k
registers.With a non-avx512 target, generating a vector masked gets optimized nicely by broadcasting the bitmask and comparing it against a constant containing the lane indices.
The masked addition should then be able to just use the same code and blend using the generated vector mask. Instead it generates quite inefficient code that tests each bit in the bitmask individually and inserts values into the a vector register:
The text was updated successfully, but these errors were encountered: