slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

llvmbot · 2021-06-27T00:57:29Z


Bugzilla Link	50905
Version	trunk
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@efriedma-quic,@bzEq,@nemanjai

Extended Description

On POWER8 and below, this code generates the same code as vec_abs, but on POWER9 the generate code is quite terrible. Compile with -mcpu=power9 -O3. Example (or on Compiler Explorer: https://godbolt.org/z/avnTxh9M6):

#include <stdint.h>

typedef int8_t i8x16 __attribute__((__vector_size__(16)));

i8x16
i8x16_abs(i8x16 a) {
    i8x16 r;

    for (int i = 0 ; i < 16 ; i++) {
        r[i] = (a[i] < 0) ? -a[i] : a[i];
    }

    return r;
}

LLVM-MCA says RThroughput is 8, vs 1.5 for the POWER8 version.

The text was updated successfully, but these errors were encountered:

bzEq · 2021-06-27T05:13:06Z

Looks there is some difference estimating loop size between pwr8 and pwr9 when performing loop unrolling. If add -mllvm -unroll-count=16, pwr9 is able generate the same code sequence as pwr8. Maybe someone has expertise at loop unroll can have a look at it.

https://godbolt.org/z/cdWzGEcrf

efriedma-quic · 2021-06-28T20:55:10Z

slp vectorizer and loop vectorizer have different capabilities, and which is relevant depends on unrolling. Trying to solve this in the unroller would be tricky; it can't really predict whether SLP vectorization will trigger. I mean, you could just boost the unroll threshold in general, but that's not worthwhile just to solve this issue.

We could enhance the loop vectorizer to analyze this sort of pattern, but it's not clear this is an important pattern in practice.

llvmbot · 2021-06-28T22:57:33Z

For abs specifically, from my perspective I've already worked around the issue in my code so I don't really care either way. I only filed the issue because I figured you would want to fix it.

I can't really provide any insight into how many people are relying on autovectorization for this, but absolute value of bytes is an important enough pattern that at least x86 and Arm provide a single instruction for the ability, and AFAICT AltiVec has had an intrinsic since the original version. WebAssembly also provides an instruction for it (which is why I noticed this issue; i8x16.abs is the first function on all the lists at https://nemequ.github.io/waspr/, and the POWER8 version being better than POWER9 caught my eye).

However, it looks like the problem is more extensive than just the absolute value. It looks like LLVM has problems with similar code on POWER which it is able to vectorize well on other platforms. For example (CE: https://godbolt.org/z/fqhfvaaY9):

#include <stdint.h>

typedef int8_t i8x16 __attribute__((__vector_size__(16)));

i8x16
i8x16_abs(i8x16 a, i8x16 b) {
    i8x16 r;

    for (int i = 0 ; i < 16 ; i++) {
        r[i] = ((a[i]) << 2) & b[i];
    }

    return r;
}

On x86 and AArch64, the compiler has no problem, but on POWER it chokes (unless I add -mllvm --unroll-count=16). Maybe this is a different issue, but it feels more like abs is the tip of the iceberg… If LLVM is unable to vectorize any operations on bytes by default, maybe bumping the unroll count would be worthwhile.

It's also worth noting that adding #pragma omp simd with -fopenmp-simd (or #pragma clang loop vectorize(enable)) doesn't do any good, either. This definitely seems like a bug to me.

efriedma-quic · 2021-06-28T23:19:42Z

The part the loop vectorizer doesn't like is the type of the variable "r"; normally autovectorization is done over arrays, not vectors.

nemanjai · 2021-07-30T15:44:06Z

I imagine this is because we have added code in Power9 to make the vectorizer cost model less aggressive (since the dispatch throughput of vector code is half the width of scalar code).

mr-c · 2023-05-21T11:56:26Z

I can confirm that as of today (2023-05-21) the slow code is still produced for POWER9 at -O3 on trunk clang

chenzheng1030 · 2024-01-08T06:55:52Z

I'll have a look at this one

…79092) P9 has vxform `Vector Extract Element Instructions` like `vextuwrx` and P10 has vxform `Vector Insert Element instructions` like `vinsd`. Update the instruction cost reflecting these instructions. Fixes #50249

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 11, 2021

chenzheng1030 self-assigned this Jan 8, 2024

chenzheng1030 mentioned this issue Jan 23, 2024

[PowerPC] adjust cost for vector insert/extract with non const index #79092

Merged

chenzheng1030 closed this as completed in #79092 Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

llvmbot commented Jun 27, 2021 •

edited by RKSimon

Loading

bzEq commented Jun 27, 2021

efriedma-quic commented Jun 28, 2021

llvmbot commented Jun 28, 2021 •

edited by RKSimon

Loading

efriedma-quic commented Jun 28, 2021

nemanjai commented Jul 30, 2021

mr-c commented May 21, 2023

chenzheng1030 commented Jan 8, 2024

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

slow code for absolute value of int8 x 16 vector on POWER9 at -O3 #50249

Comments

llvmbot commented Jun 27, 2021 • edited by RKSimon Loading

Extended Description

bzEq commented Jun 27, 2021

efriedma-quic commented Jun 28, 2021

llvmbot commented Jun 28, 2021 • edited by RKSimon Loading

efriedma-quic commented Jun 28, 2021

nemanjai commented Jul 30, 2021

mr-c commented May 21, 2023

chenzheng1030 commented Jan 8, 2024

llvmbot commented Jun 27, 2021 •

edited by RKSimon

Loading

llvmbot commented Jun 28, 2021 •

edited by RKSimon

Loading