Complex division is not optimised with -ffast-math #31220

lesshaste · 2017-02-05T23:52:51Z


Bugzilla Link	31872
Version	trunk
OS	Linux
CC	@hyp,@compnerd,@lesshaste,@hfinkel,@joker-eph,@RKSimon,@rotateright,@TNorthover

Extended Description

Consider:

#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}

clang trunk with -O3 -march=core-avx2 but with or without -ffast-math gives:

f: # @f
vmovaps xmm2, xmm1
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
jmp __divsc3 # TAILCALL

However both gcc and ICC attempt to optimise this code when -ffast-math (or equivalent) is enabled.

ICC appears to give the fastest code which is:

f:
vcvtps2pd xmm2, xmm1 #3.12
vcvtps2pd xmm4, xmm0 #3.12
vmulpd xmm8, xmm2, xmm2 #3.12
vunpckhpd xmm3, xmm2, xmm2 #3.12
vmulpd xmm6, xmm3, xmm4 #3.12
vmovddup xmm7, xmm2 #3.12
vshufpd xmm5, xmm4, xmm4, 1 #3.12
vshufpd xmm9, xmm8, xmm8, 1 #3.12
vfmaddsub213pd xmm7, xmm5, xmm6 #3.12
vaddpd xmm11, xmm8, xmm9 #3.12
vshufpd xmm10, xmm7, xmm7, 1 #3.12
vdivpd xmm12, xmm10, xmm11 #3.12
vcvtpd2ps xmm0, xmm12 #3.12
ret

hfinkel · 2017-02-06T18:24:48Z

As a note, when Alex worked on the flang project, he had to experiment with different division algorithms to find one that was sufficiently numerically stable (so that the BLAS regressions tests would pass, as I recall). This may or may not be relevant to fast-math complication, but just in case, see:
https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp

hyp · 2017-02-06T19:21:47Z

It looks like ICC doesn't use the Smith's version, it looks more like the naive division, i.e. "(a+ib) / (c+id) = ((ac+bd)/(cc+dd)) + i((bc-ad)/(cc+dd))". But they promote the floats to doubles, so I guess that makes the precision of the naive algorithm better.

What does ICC do for complex double with -ffast-math?

lesshaste · 2017-02-06T21:29:17Z

ICC appears to have a bug/feature when you change the type to complex double where you have to set -fp-model fast=2 to get anything sensible (with fast=1 you get x87 code!). In the fast=2 case you get:

f:
vunpcklpd xmm4, xmm2, xmm3 #2.54
vunpcklpd xmm6, xmm0, xmm1 #2.54
vunpckhpd xmm5, xmm4, xmm4 #3.12
vmulpd xmm10, xmm4, xmm4 #3.12
vmulpd xmm8, xmm5, xmm6 #3.12
vmovddup xmm9, xmm4 #3.12
vshufpd xmm7, xmm6, xmm6, 1 #3.12
vshufpd xmm11, xmm10, xmm10, 1 #3.12
vfmaddsub213pd xmm9, xmm7, xmm8 #3.12
vaddpd xmm13, xmm10, xmm11 #3.12
vshufpd xmm12, xmm9, xmm9, 1 #3.12
vdivpd xmm0, xmm12, xmm13 #3.12
vunpckhpd xmm1, xmm0, xmm0 #3.12
ret #3.12

gcc -O3 -ffast-math -march=core-avx2 for the complex double code on the other hand gives:

f:
vmulsd xmm4, xmm1, xmm3
vmovapd xmm6, xmm0
vmulsd xmm5, xmm3, xmm3
vmulsd xmm6, xmm6, xmm3
vfmadd231sd xmm4, xmm0, xmm2
vfmadd231sd xmm5, xmm2, xmm2
vfmsub132sd xmm1, xmm6, xmm2
vdivsd xmm0, xmm4, xmm5
vdivsd xmm1, xmm1, xmm5
ret

This may be better code, I am not expert enough to tell.

hyp · 2017-02-07T10:45:11Z

I think it could be a feature in ICC. With '-ffast-math' ICC promotes complex floats to doubles, do it's reasonable to assume that it would promote complex doubles to the 80 bit long doubles. That means it can't use SSE/AVX, and it has to emit the x87 FPU code.

What does ICC do for complex float and '-fp-model fast=2'? I suspect it doesn't promote them to doubles.

Looking at comparison between ICC and GCC it's interesting how ICC leverage the *pd instructions to reduce the number of arithmetic instructions in the code. I'm not sure if it's better than GCC's version though in terms of performance. It definitely looks worse from the code-size perspective.

lesshaste · 2017-02-07T11:01:20Z

I looked at ICC and this code as requested:

#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}

Using '-fp-model fast=2 -march=core-avx2 -O3' you get

f:
vmovlhps xmm2, xmm1, xmm1 #3.12
vmulps xmm8, xmm2, xmm2 #3.12
vshufps xmm9, xmm8, xmm8, 177 #3.12
vmovlhps xmm4, xmm0, xmm0 #3.12
vaddps xmm10, xmm8, xmm9 #3.12
vrcpps xmm11, xmm10 #3.12
vmovshdup xmm3, xmm2 #3.12
vaddps xmm12, xmm11, xmm11 #3.12
vmulps xmm6, xmm4, xmm3 #3.12
vmulps xmm14, xmm11, xmm10 #3.12
vmovsldup xmm7, xmm2 #3.12
vshufps xmm5, xmm4, xmm4, 177 #3.12
vfmaddsub213ps xmm7, xmm5, xmm6 #3.12
vfnmadd213ps xmm14, xmm11, xmm12 #3.12
vshufps xmm13, xmm7, xmm7, 177 #3.12
vmulps xmm0, xmm13, xmm14 #3.12
ret

This is slightly longer code than you get for fast=1 performing 4 multiplications and one reciprocal.

It might be worth mentioning that http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf discusses vectorized complex arithmetic, for example in 6.6.1.1 which covers multiplication and division using SSE3. I don't know if it's helpful here.

lesshaste · 2017-02-07T11:20:28Z

For completeness:

#include <complex.h>
complex long double f(complex long double x, complex long double y) {
return x/y;
}

In clang with -march=core-avx2 -Ofast you get

f: # @f
sub rsp, 72
fld xword ptr [rsp + 80]
fld xword ptr [rsp + 96]
fld xword ptr [rsp + 112]
fld xword ptr [rsp + 128]
fstp xword ptr [rsp + 48]
fstp xword ptr [rsp + 32]
fstp xword ptr [rsp + 16]
fstp xword ptr [rsp]
call __divxc3
add rsp, 72
ret

In gcc and ICC the __divxc3 code is optimised (very differently) but still using x87 instructions. The gcc code is much shorter (and I suspect better) than the ICC code and is:

f:
fld TBYTE PTR [rsp+8]
fld TBYTE PTR [rsp+24]
fld TBYTE PTR [rsp+40]
fld TBYTE PTR [rsp+56]
fld st(1)
fmul st, st(2)
fld st(1)
fmul st, st(2)
faddp st(1), st
fld st(4)
fmul st, st(3)
fld st(4)
fmul st, st(3)
faddp st(1), st
fdiv st, st(1)
fxch st(4)
fmulp st(3), st
fxch st(4)
fmulp st(1), st
fsubp st(1), st
fdivrp st(2), st
ret

hyp · 2017-02-07T12:29:58Z

Thanks for collecting all of this information! It looks like I was right about ICC not promoting complex float with '-fp-model fast=2'.

I wonder if clang/llvm should follow ICC and try to promote the floating point types with '-ffast-math' or should it just use the original type like GCC seems to do even though the numerical stability might be effected.

jcranmer-intel · 2024-07-12T15:57:36Z

As of #70244, -ffast-math implies limited range division, and #81514 adds some more fine-grained control over which division algorithm is selected. I think this bug can be closed now as fixed.

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Endilll added missed-optimization floating-point Floating-point math labels Jul 5, 2024

jcranmer-intel closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complex division is not optimised with -ffast-math #31220

Complex division is not optimised with -ffast-math #31220

lesshaste mannequin commented Feb 5, 2017

hfinkel commented Feb 6, 2017

hyp commented Feb 6, 2017

lesshaste mannequin commented Feb 6, 2017

hyp commented Feb 7, 2017

lesshaste mannequin commented Feb 7, 2017

lesshaste mannequin commented Feb 7, 2017

hyp commented Feb 7, 2017

jcranmer-intel commented Jul 12, 2024

Complex division is not optimised with -ffast-math #31220

Complex division is not optimised with -ffast-math #31220

Comments

lesshaste mannequin commented Feb 5, 2017

Extended Description

hfinkel commented Feb 6, 2017

hyp commented Feb 6, 2017

lesshaste mannequin commented Feb 6, 2017

hyp commented Feb 7, 2017

lesshaste mannequin commented Feb 7, 2017

lesshaste mannequin commented Feb 7, 2017

hyp commented Feb 7, 2017

jcranmer-intel commented Jul 12, 2024