-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249
Comments
Looks there is some difference estimating loop size between pwr8 and pwr9 when performing loop unrolling. If add |
slp vectorizer and loop vectorizer have different capabilities, and which is relevant depends on unrolling. Trying to solve this in the unroller would be tricky; it can't really predict whether SLP vectorization will trigger. I mean, you could just boost the unroll threshold in general, but that's not worthwhile just to solve this issue. We could enhance the loop vectorizer to analyze this sort of pattern, but it's not clear this is an important pattern in practice. |
For abs specifically, from my perspective I've already worked around the issue in my code so I don't really care either way. I only filed the issue because I figured you would want to fix it. I can't really provide any insight into how many people are relying on autovectorization for this, but absolute value of bytes is an important enough pattern that at least x86 and Arm provide a single instruction for the ability, and AFAICT AltiVec has had an intrinsic since the original version. WebAssembly also provides an instruction for it (which is why I noticed this issue; i8x16.abs is the first function on all the lists at https://nemequ.github.io/waspr/, and the POWER8 version being better than POWER9 caught my eye). However, it looks like the problem is more extensive than just the absolute value. It looks like LLVM has problems with similar code on POWER which it is able to vectorize well on other platforms. For example (CE: https://godbolt.org/z/fqhfvaaY9):
On x86 and AArch64, the compiler has no problem, but on POWER it chokes (unless I add It's also worth noting that adding |
The part the loop vectorizer doesn't like is the type of the variable "r"; normally autovectorization is done over arrays, not vectors. |
I imagine this is because we have added code in Power9 to make the vectorizer cost model less aggressive (since the dispatch throughput of vector code is half the width of scalar code). |
I can confirm that as of today (2023-05-21) the slow code is still produced for POWER9 at -O3 on trunk clang |
I'll have a look at this one |
Extended Description
On POWER8 and below, this code generates the same code as vec_abs, but on POWER9 the generate code is quite terrible. Compile with -mcpu=power9 -O3. Example (or on Compiler Explorer: https://godbolt.org/z/avnTxh9M6):
LLVM-MCA says RThroughput is 8, vs 1.5 for the POWER8 version.
The text was updated successfully, but these errors were encountered: