Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of sin, cos #12859

Closed
dressel opened this issue Aug 28, 2015 · 21 comments
Closed

Performance of sin, cos #12859

dressel opened this issue Aug 28, 2015 · 21 comments
Labels
performance Must go faster

Comments

@dressel
Copy link

dressel commented Aug 28, 2015

I implemented some code in C and Julia, and was surprised about how much slower Julia was. I suspected an inner loop that called cos, so I ran repeated calls to cos in both C and Julia with the following script:

# Using Julia
function cos_j ( start_val :: Float64 , end_val :: Float64 )
    step = (1e-6)
    for t = start_val : step : end_val
        for i = 1:200
            cos ( t )
        end
    end
end

# Using C
function cos_c ( start_val :: Float64 , end_val :: Float64 )
    run ( ‘./ct $start_val $end_val‘)
end

# Force functions to compile first
cos_j (0.0 , 1.0)
run ( ‘gcc ct.c -o ct')
cos_c (0.0 , 1.0)
time ()

# Julia
t1 = time ()
cos_j (0.0 , 1.0)
t2 = time ()
println ( " \ ncos_j (0 ,1) completes in " , round ( t2 - t1 ,3) ," sec " )
t1 = time ()
cos_j (1.0 , 2.0)
t2 = time ()
println ( " cos_j (1 ,2) completes in " , round ( t2 - t1 ,3) ," sec " )

# C
t1 = time ()
cos_c (0.0 , 1.0)
t2 = time ()
println ( " cos_c (0 ,1) completes in " , round ( t2 - t1 ,3) ," sec " )
t1 = time ()
cos_c (1.0 , 2.0)
t2 = time ()
println ( " cos_c (1 ,2) completes in " , round ( t2 - t1 ,3) ," sec " )

The C code:

/∗ ct.c ∗/
#include <stdio.h>
#include <stdlib.h>    // for atof
#include <math.h>    // for cos

int main ( int argc, charargv[] )
{
    double t ;
    double step = 1e6;
    double startval = atof ( argv [1] ) ;
    double endval = atof ( argv [2] ) ;
    int i ;
    for ( t = startval ; t < endval ; t += step )
    {
        for ( i = 0 ; i < 200 ; i++)
            cos ( t ) ;
    }
    return 0 ;
}

The output from the Julia script:

cos_j (0, 1) completes in 1.267 sec
cos_j (1, 2) completes in 1.917 sec
cos_c (0, 1) completes in 0.388 sec
cos_c (1, 2) completes in 0.384 sec

Not only was the Julia code significantly slower, its execution time varied with the time range, whereas the C code's execution time was relatively constant. I'm not sure how the processor computes cosine, but I assume it involves a Taylor expansion--expanding for the time range (0,1) might be easier.

I assumed that Julia's calls to sin and cos would be the same as C's. Is it possible Julia's call to cos asks for higher precision? Or am I making some mistake in the Julia code (type stability, memory issues, etc.)?

I also tried using ccall, but the results were actually slower than either of these. I didn't include it for brevity, but I can put it up if it helps understand the issue. Any ideas what this could be?

@KristofferC
Copy link
Member

Looking at the compiled code with @code_native might be useful.

@pao
Copy link
Member

pao commented Aug 28, 2015

Please also provide the output of versioninfo() from the Julia console.

@JeffBezanson JeffBezanson added the performance Must go faster label Aug 28, 2015
@dressel
Copy link
Author

dressel commented Aug 28, 2015

julia> versioninfo()
Julia Version 0.3.10
Commit c8ceeef (2015-06-24 13:54 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

@yuyichao
Copy link
Contributor

Very likely dup of #12830 or #9942

@JeffBezanson
Copy link
Member

@yuyichao From the run times I wouldn't guess the C compiler is deleting or folding the loop.

@yuyichao
Copy link
Contributor

With GCC with default optimization level, the assembly dump actually doesn't call the cos function at all. (and you get a warning saying the cos(t); line is doing nothing with -Wall).

@yuyichao
Copy link
Contributor

@JeffBezanson I don't understand why gcc decide to remove the call to cos but leave the loop here.

Here's the disassembly with default optimization level if you want to have a look.

0000000000000660 <cos_c>:
 660:   55                      push   %rbp
 661:   48 89 e5                mov    %rsp,%rbp
 664:   f2 0f 11 45 d8          movsd  %xmm0,-0x28(%rbp)
 669:   f2 0f 11 4d d0          movsd  %xmm1,-0x30(%rbp)
 66e:   48 89 7d c8             mov    %rdi,-0x38(%rbp)
 672:   f2 0f 10 05 56 00 00    movsd  0x56(%rip),%xmm0        # 6d0 <_fini+0xc>
 679:   00 
 67a:   f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)
 67f:   f2 0f 10 45 d8          movsd  -0x28(%rbp),%xmm0
 684:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 689:   eb 28                   jmp    6b3 <cos_c+0x53>
 68b:   48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
 692:   00 
 693:   eb 05                   jmp    69a <cos_c+0x3a>
 695:   48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
 69a:   48 8b 45 f0             mov    -0x10(%rbp),%rax
 69e:   48 3b 45 c8             cmp    -0x38(%rbp),%rax
 6a2:   7c f1                   jl     695 <cos_c+0x35>
 6a4:   f2 0f 10 45 f8          movsd  -0x8(%rbp),%xmm0
 6a9:   f2 0f 58 45 e8          addsd  -0x18(%rbp),%xmm0
 6ae:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 6b3:   f2 0f 10 45 d0          movsd  -0x30(%rbp),%xmm0
 6b8:   66 0f 2e 45 f8          ucomisd -0x8(%rbp),%xmm0
 6bd:   77 cc                   ja     68b <cos_c+0x2b>
 6bf:   90                      nop
 6c0:   5d                      pop    %rbp
 6c1:   c3                      retq   

With gcc -O2 most of the loop is gone (I guess because the floating point increment can be infinite loop?).

0000000000000660 <cos_c>:
 660:   66 0f 2e c8             ucomisd %xmm0,%xmm1
 664:   f2 0f 10 15 1c 00 00    movsd  0x1c(%rip),%xmm2        # 688 <_fini+0xc>
 66b:   00 
 66c:   76 0c                   jbe    67a <cos_c+0x1a>
 66e:   66 90                   xchg   %ax,%ax
 670:   f2 0f 58 c2             addsd  %xmm2,%xmm0
 674:   66 0f 2e c8             ucomisd %xmm0,%xmm1
 678:   77 f6                   ja     670 <cos_c+0x10>
 67a:   f3 c3                   repz retq 

@JeffBezanson
Copy link
Member

Well, how about that. Could you try with -O0?

@yuyichao
Copy link
Contributor

On the other hand, clang 3.6.2 is not able to remove the cos call even with -O3. It dosen't give a warning that cos(t) is no-op either.

Also, FWIW, with gcc (5.2.0) putting the call to cos in a noinline function disables this optimization and yield the same performance with julia

double __attribute__((noinline))
call_cos(double t)
{
    return cos(t);
}

@yuyichao
Copy link
Contributor

Hmm, gcc warns about no op and removes the call to cos even with -O0. (asm here in case I missed the call)

 6ef:   55                      push   %rbp
 6f0:   48 89 e5                mov    %rsp,%rbp
 6f3:   f2 0f 11 45 d8          movsd  %xmm0,-0x28(%rbp)
 6f8:   f2 0f 11 4d d0          movsd  %xmm1,-0x30(%rbp)
 6fd:   48 89 7d c8             mov    %rdi,-0x38(%rbp)
 701:   f2 0f 10 05 57 00 00    movsd  0x57(%rip),%xmm0        # 760 <_fini+0xc>
 708:   00 
 709:   f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)
 70e:   f2 0f 10 45 d8          movsd  -0x28(%rbp),%xmm0
 713:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 718:   eb 28                   jmp    742 <cos_c+0x53>
 71a:   48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
 721:   00 
 722:   eb 05                   jmp    729 <cos_c+0x3a>
 724:   48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
 729:   48 8b 45 f0             mov    -0x10(%rbp),%rax
 72d:   48 3b 45 c8             cmp    -0x38(%rbp),%rax
 731:   7c f1                   jl     724 <cos_c+0x35>
 733:   f2 0f 10 45 f8          movsd  -0x8(%rbp),%xmm0
 738:   f2 0f 58 45 e8          addsd  -0x18(%rbp),%xmm0
 73d:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 742:   f2 0f 10 45 d0          movsd  -0x30(%rbp),%xmm0
 747:   66 0f 2e 45 f8          ucomisd -0x8(%rbp),%xmm0
 74c:   77 cc                   ja     71a <cos_c+0x2b>
 74e:   90                      nop
 74f:   5d                      pop    %rbp
 750:   c3                      retq   

@JeffBezanson
Copy link
Member

How does clang's performance compare to julia then?
Sorry, didn't see that you mentioned gcc's time when actually calling cos.

@yuyichao
Copy link
Contributor

Clang performs the same with julia here. (And I'm using system libm for both.)

@JeffBezanson
Copy link
Member

@dressel I guess the performance gap in your application must have a different root cause?

@yuyichao
Copy link
Contributor

Gcc also performs the same when that optimization is forced off as mentioned above. I think there's another optmization option to tell gcc not to optimize out standard library calls but I couldn't find it now....

@yuyichao
Copy link
Contributor

Found it, -fno-builtin disables this optimization for Gcc.

yuyichao added a commit to yuyichao/explore that referenced this issue Aug 28, 2015
@dressel
Copy link
Author

dressel commented Aug 28, 2015

Yikes, that was it. When I modify the C file to do something with the output of the cos call (like sum them up), gcc tells me I have an undefined reference to cos. I need to include "-lm" during compilation to include the math library. I suppose it didn't tell me before because it just ignored the call to cos. I'm sorry about that, thank you all for your help.

@JeffBezanson Yeah, it must be something else. I suspected cos because the time range affected the Julia execution time more than in C (in my application, I do use the output of cos).

@yuyichao
Copy link
Contributor

@dressel It's probably better to use ccall when comparing performance with c. So that you don't need to pay for the overhead of command line formatting and executing external program. See yuyichao/explore@f97ea4b for my test scripts. (It also avoid the link error because it can load the symbol from julia....)

Edit: P.S. the reason using ccall to call cos is slower in julia is very likely because you are calling the system libm version instead of the openlibm or whatever faster version julia is using by default.

@dressel
Copy link
Author

dressel commented Aug 28, 2015

@yuyichao Ahh, that is very cool. Using libopenlibm gives roughly the same performance as Julia. Thanks for your tips!

@eschnett
Copy link
Contributor

Incidentally, there are #9942 and #12830 (work in progress) addressing this issue. This should LLVM to constant-fold and thus either optimize or completely eliminate these loops. This optimization is also the reason behind the significant speedup discussed in http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/.

@yuyichao
Copy link
Contributor

@eschnett #12859 (comment)

@SimonEnsemble
Copy link

Is this still the case? I spent hours optimizing a function with 10 lines of heavy array computations, and a cosine is still 90% of the computation time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants