Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

Closed
llvmbot opened this issue Jun 27, 2021 · 7 comments · Fixed by #79092
Closed

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

llvmbot opened this issue Jun 27, 2021 · 7 comments · Fixed by #79092
Assignees
Labels
backend:PowerPC bugzilla Issues migrated from bugzilla

Comments

@llvmbot
Copy link
Member

llvmbot commented Jun 27, 2021

Bugzilla Link 50905
Version trunk
OS Windows NT
Reporter LLVM Bugzilla Contributor
CC @efriedma-quic,@bzEq,@nemanjai

Extended Description

On POWER8 and below, this code generates the same code as vec_abs, but on POWER9 the generate code is quite terrible. Compile with -mcpu=power9 -O3. Example (or on Compiler Explorer: https://godbolt.org/z/avnTxh9M6):

#include <stdint.h>

typedef int8_t i8x16 __attribute__((__vector_size__(16)));

i8x16
i8x16_abs(i8x16 a) {
    i8x16 r;

    for (int i = 0 ; i < 16 ; i++) {
        r[i] = (a[i] < 0) ? -a[i] : a[i];
    }

    return r;
}

LLVM-MCA says RThroughput is 8, vs 1.5 for the POWER8 version.

@bzEq
Copy link
Collaborator

bzEq commented Jun 27, 2021

Looks there is some difference estimating loop size between pwr8 and pwr9 when performing loop unrolling. If add -mllvm -unroll-count=16, pwr9 is able generate the same code sequence as pwr8. Maybe someone has expertise at loop unroll can have a look at it.

https://godbolt.org/z/cdWzGEcrf

@efriedma-quic
Copy link
Collaborator

slp vectorizer and loop vectorizer have different capabilities, and which is relevant depends on unrolling. Trying to solve this in the unroller would be tricky; it can't really predict whether SLP vectorization will trigger. I mean, you could just boost the unroll threshold in general, but that's not worthwhile just to solve this issue.

We could enhance the loop vectorizer to analyze this sort of pattern, but it's not clear this is an important pattern in practice.

@llvmbot
Copy link
Member Author

llvmbot commented Jun 28, 2021

For abs specifically, from my perspective I've already worked around the issue in my code so I don't really care either way. I only filed the issue because I figured you would want to fix it.

I can't really provide any insight into how many people are relying on autovectorization for this, but absolute value of bytes is an important enough pattern that at least x86 and Arm provide a single instruction for the ability, and AFAICT AltiVec has had an intrinsic since the original version. WebAssembly also provides an instruction for it (which is why I noticed this issue; i8x16.abs is the first function on all the lists at https://nemequ.github.io/waspr/, and the POWER8 version being better than POWER9 caught my eye).

However, it looks like the problem is more extensive than just the absolute value. It looks like LLVM has problems with similar code on POWER which it is able to vectorize well on other platforms. For example (CE: https://godbolt.org/z/fqhfvaaY9):

#include <stdint.h>

typedef int8_t i8x16 __attribute__((__vector_size__(16)));

i8x16
i8x16_abs(i8x16 a, i8x16 b) {
    i8x16 r;

    for (int i = 0 ; i < 16 ; i++) {
        r[i] = ((a[i]) << 2) & b[i];
    }

    return r;
}

On x86 and AArch64, the compiler has no problem, but on POWER it chokes (unless I add -mllvm --unroll-count=16). Maybe this is a different issue, but it feels more like abs is the tip of the iceberg… If LLVM is unable to vectorize any operations on bytes by default, maybe bumping the unroll count would be worthwhile.

It's also worth noting that adding #pragma omp simd with -fopenmp-simd (or #pragma clang loop vectorize(enable)) doesn't do any good, either. This definitely seems like a bug to me.

@efriedma-quic
Copy link
Collaborator

The part the loop vectorizer doesn't like is the type of the variable "r"; normally autovectorization is done over arrays, not vectors.

@nemanjai
Copy link
Member

I imagine this is because we have added code in Power9 to make the vectorizer cost model less aggressive (since the dispatch throughput of vector code is half the width of scalar code).

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021
@mr-c
Copy link

mr-c commented May 21, 2023

I can confirm that as of today (2023-05-21) the slow code is still produced for POWER9 at -O3 on trunk clang

@chenzheng1030 chenzheng1030 self-assigned this Jan 8, 2024
@chenzheng1030
Copy link
Collaborator

I'll have a look at this one

chenzheng1030 pushed a commit that referenced this issue Feb 19, 2024
…79092)

P9 has vxform `Vector Extract Element Instructions` like `vextuwrx` and
P10 has vxform `Vector Insert Element instructions` like `vinsd`. Update
the instruction cost reflecting these instructions.

Fixes #50249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:PowerPC bugzilla Issues migrated from bugzilla
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants