Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex division is not optimised with -ffast-math #31220

Closed
lesshaste mannequin opened this issue Feb 5, 2017 · 8 comments
Closed

Complex division is not optimised with -ffast-math #31220

lesshaste mannequin opened this issue Feb 5, 2017 · 8 comments
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math missed-optimization

Comments

@lesshaste
Copy link
Mannequin

lesshaste mannequin commented Feb 5, 2017

Bugzilla Link 31872
Version trunk
OS Linux
CC @hyp,@compnerd,@lesshaste,@hfinkel,@joker-eph,@RKSimon,@rotateright,@TNorthover

Extended Description

Consider:

#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}

clang trunk with -O3 -march=core-avx2 but with or without -ffast-math gives:

f: # @​f
vmovaps xmm2, xmm1
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
jmp __divsc3 # TAILCALL

However both gcc and ICC attempt to optimise this code when -ffast-math (or equivalent) is enabled.

ICC appears to give the fastest code which is:

f:
vcvtps2pd xmm2, xmm1 #​3.12
vcvtps2pd xmm4, xmm0 #​3.12
vmulpd xmm8, xmm2, xmm2 #​3.12
vunpckhpd xmm3, xmm2, xmm2 #​3.12
vmulpd xmm6, xmm3, xmm4 #​3.12
vmovddup xmm7, xmm2 #​3.12
vshufpd xmm5, xmm4, xmm4, 1 #​3.12
vshufpd xmm9, xmm8, xmm8, 1 #​3.12
vfmaddsub213pd xmm7, xmm5, xmm6 #​3.12
vaddpd xmm11, xmm8, xmm9 #​3.12
vshufpd xmm10, xmm7, xmm7, 1 #​3.12
vdivpd xmm12, xmm10, xmm11 #​3.12
vcvtpd2ps xmm0, xmm12 #​3.12
ret

@hfinkel
Copy link
Collaborator

hfinkel commented Feb 6, 2017

As a note, when Alex worked on the flang project, he had to experiment with different division algorithms to find one that was sufficiently numerically stable (so that the BLAS regressions tests would pass, as I recall). This may or may not be relevant to fast-math complication, but just in case, see:
https://github.com/hyp/flang/blob/master/lib/CodeGen/CGExprComplex.cpp

@hyp
Copy link
Member

hyp commented Feb 6, 2017

It looks like ICC doesn't use the Smith's version, it looks more like the naive division, i.e. "(a+ib) / (c+id) = ((ac+bd)/(cc+dd)) + i((bc-ad)/(cc+dd))". But they promote the floats to doubles, so I guess that makes the precision of the naive algorithm better.

What does ICC do for complex double with -ffast-math?

@lesshaste
Copy link
Mannequin Author

lesshaste mannequin commented Feb 6, 2017

ICC appears to have a bug/feature when you change the type to complex double where you have to set -fp-model fast=2 to get anything sensible (with fast=1 you get x87 code!). In the fast=2 case you get:

f:
vunpcklpd xmm4, xmm2, xmm3 #​2.54
vunpcklpd xmm6, xmm0, xmm1 #​2.54
vunpckhpd xmm5, xmm4, xmm4 #​3.12
vmulpd xmm10, xmm4, xmm4 #​3.12
vmulpd xmm8, xmm5, xmm6 #​3.12
vmovddup xmm9, xmm4 #​3.12
vshufpd xmm7, xmm6, xmm6, 1 #​3.12
vshufpd xmm11, xmm10, xmm10, 1 #​3.12
vfmaddsub213pd xmm9, xmm7, xmm8 #​3.12
vaddpd xmm13, xmm10, xmm11 #​3.12
vshufpd xmm12, xmm9, xmm9, 1 #​3.12
vdivpd xmm0, xmm12, xmm13 #​3.12
vunpckhpd xmm1, xmm0, xmm0 #​3.12
ret #​3.12

gcc -O3 -ffast-math -march=core-avx2 for the complex double code on the other hand gives:

f:
vmulsd xmm4, xmm1, xmm3
vmovapd xmm6, xmm0
vmulsd xmm5, xmm3, xmm3
vmulsd xmm6, xmm6, xmm3
vfmadd231sd xmm4, xmm0, xmm2
vfmadd231sd xmm5, xmm2, xmm2
vfmsub132sd xmm1, xmm6, xmm2
vdivsd xmm0, xmm4, xmm5
vdivsd xmm1, xmm1, xmm5
ret

This may be better code, I am not expert enough to tell.

@hyp
Copy link
Member

hyp commented Feb 7, 2017

I think it could be a feature in ICC. With '-ffast-math' ICC promotes complex floats to doubles, do it's reasonable to assume that it would promote complex doubles to the 80 bit long doubles. That means it can't use SSE/AVX, and it has to emit the x87 FPU code.

What does ICC do for complex float and '-fp-model fast=2'? I suspect it doesn't promote them to doubles.

Looking at comparison between ICC and GCC it's interesting how ICC leverage the *pd instructions to reduce the number of arithmetic instructions in the code. I'm not sure if it's better than GCC's version though in terms of performance. It definitely looks worse from the code-size perspective.

@lesshaste
Copy link
Mannequin Author

lesshaste mannequin commented Feb 7, 2017

I looked at ICC and this code as requested:

#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}

Using '-fp-model fast=2 -march=core-avx2 -O3' you get

f:
vmovlhps xmm2, xmm1, xmm1 #​3.12
vmulps xmm8, xmm2, xmm2 #​3.12
vshufps xmm9, xmm8, xmm8, 177 #​3.12
vmovlhps xmm4, xmm0, xmm0 #​3.12
vaddps xmm10, xmm8, xmm9 #​3.12
vrcpps xmm11, xmm10 #​3.12
vmovshdup xmm3, xmm2 #​3.12
vaddps xmm12, xmm11, xmm11 #​3.12
vmulps xmm6, xmm4, xmm3 #​3.12
vmulps xmm14, xmm11, xmm10 #​3.12
vmovsldup xmm7, xmm2 #​3.12
vshufps xmm5, xmm4, xmm4, 177 #​3.12
vfmaddsub213ps xmm7, xmm5, xmm6 #​3.12
vfnmadd213ps xmm14, xmm11, xmm12 #​3.12
vshufps xmm13, xmm7, xmm7, 177 #​3.12
vmulps xmm0, xmm13, xmm14 #​3.12
ret

This is slightly longer code than you get for fast=1 performing 4 multiplications and one reciprocal.

It might be worth mentioning that http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf discusses vectorized complex arithmetic, for example in 6.6.1.1 which covers multiplication and division using SSE3. I don't know if it's helpful here.

@lesshaste
Copy link
Mannequin Author

lesshaste mannequin commented Feb 7, 2017

For completeness:

#include <complex.h>
complex long double f(complex long double x, complex long double y) {
return x/y;
}

In clang with -march=core-avx2 -Ofast you get

f: # @​f
sub rsp, 72
fld xword ptr [rsp + 80]
fld xword ptr [rsp + 96]
fld xword ptr [rsp + 112]
fld xword ptr [rsp + 128]
fstp xword ptr [rsp + 48]
fstp xword ptr [rsp + 32]
fstp xword ptr [rsp + 16]
fstp xword ptr [rsp]
call __divxc3
add rsp, 72
ret

In gcc and ICC the __divxc3 code is optimised (very differently) but still using x87 instructions. The gcc code is much shorter (and I suspect better) than the ICC code and is:

f:
fld TBYTE PTR [rsp+8]
fld TBYTE PTR [rsp+24]
fld TBYTE PTR [rsp+40]
fld TBYTE PTR [rsp+56]
fld st(1)
fmul st, st(2)
fld st(1)
fmul st, st(2)
faddp st(1), st
fld st(4)
fmul st, st(3)
fld st(4)
fmul st, st(3)
faddp st(1), st
fdiv st, st(1)
fxch st(4)
fmulp st(3), st
fxch st(4)
fmulp st(1), st
fsubp st(1), st
fdivrp st(2), st
ret

@hyp
Copy link
Member

hyp commented Feb 7, 2017

Thanks for collecting all of this information! It looks like I was right about ICC not promoting complex float with '-fp-model fast=2'.

I wonder if clang/llvm should follow ICC and try to promote the floating point types with '-ffast-math' or should it just use the original type like GCC seems to do even though the numerical stability might be effected.

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@jcranmer-intel
Copy link
Contributor

As of #70244, -ffast-math implies limited range division, and #81514 adds some more fine-grained control over which division algorithm is selected. I think this bug can be closed now as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla floating-point Floating-point math missed-optimization
Projects
None yet
Development

No branches or pull requests

4 participants