-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex division is not optimised with -ffast-math #31220
Comments
As a note, when Alex worked on the flang project, he had to experiment with different division algorithms to find one that was sufficiently numerically stable (so that the BLAS regressions tests would pass, as I recall). This may or may not be relevant to fast-math complication, but just in case, see: |
It looks like ICC doesn't use the Smith's version, it looks more like the naive division, i.e. "(a+ib) / (c+id) = ((ac+bd)/(cc+dd)) + i((bc-ad)/(cc+dd))". But they promote the floats to doubles, so I guess that makes the precision of the naive algorithm better. What does ICC do for |
ICC appears to have a bug/feature when you change the type to complex double where you have to set -fp-model fast=2 to get anything sensible (with fast=1 you get x87 code!). In the fast=2 case you get: f: gcc -O3 -ffast-math -march=core-avx2 for the complex double code on the other hand gives: f: This may be better code, I am not expert enough to tell. |
I think it could be a feature in ICC. With '-ffast-math' ICC promotes complex floats to doubles, do it's reasonable to assume that it would promote complex doubles to the 80 bit long doubles. That means it can't use SSE/AVX, and it has to emit the x87 FPU code. What does ICC do for Looking at comparison between ICC and GCC it's interesting how ICC leverage the *pd instructions to reduce the number of arithmetic instructions in the code. I'm not sure if it's better than GCC's version though in terms of performance. It definitely looks worse from the code-size perspective. |
I looked at ICC and this code as requested: #include <complex.h> Using '-fp-model fast=2 -march=core-avx2 -O3' you get f: This is slightly longer code than you get for fast=1 performing 4 multiplications and one reciprocal. It might be worth mentioning that http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf discusses vectorized complex arithmetic, for example in 6.6.1.1 which covers multiplication and division using SSE3. I don't know if it's helpful here. |
For completeness: #include <complex.h> In clang with -march=core-avx2 -Ofast you get f: # @f In gcc and ICC the __divxc3 code is optimised (very differently) but still using x87 instructions. The gcc code is much shorter (and I suspect better) than the ICC code and is: f: |
Thanks for collecting all of this information! It looks like I was right about ICC not promoting I wonder if clang/llvm should follow ICC and try to promote the floating point types with '-ffast-math' or should it just use the original type like GCC seems to do even though the numerical stability might be effected. |
Extended Description
Consider:
#include <complex.h>
complex float f(complex float x, complex float y) {
return x/y;
}
clang trunk with -O3 -march=core-avx2 but with or without -ffast-math gives:
f: # @f
vmovaps xmm2, xmm1
vmovshdup xmm1, xmm0 # xmm1 = xmm0[1,1,3,3]
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
jmp __divsc3 # TAILCALL
However both gcc and ICC attempt to optimise this code when -ffast-math (or equivalent) is enabled.
ICC appears to give the fastest code which is:
f:
vcvtps2pd xmm2, xmm1 #3.12
vcvtps2pd xmm4, xmm0 #3.12
vmulpd xmm8, xmm2, xmm2 #3.12
vunpckhpd xmm3, xmm2, xmm2 #3.12
vmulpd xmm6, xmm3, xmm4 #3.12
vmovddup xmm7, xmm2 #3.12
vshufpd xmm5, xmm4, xmm4, 1 #3.12
vshufpd xmm9, xmm8, xmm8, 1 #3.12
vfmaddsub213pd xmm7, xmm5, xmm6 #3.12
vaddpd xmm11, xmm8, xmm9 #3.12
vshufpd xmm10, xmm7, xmm7, 1 #3.12
vdivpd xmm12, xmm10, xmm11 #3.12
vcvtpd2ps xmm0, xmm12 #3.12
ret
The text was updated successfully, but these errors were encountered: