Performance of sin, cos #12859

dressel · 2015-08-28T21:43:11Z

I implemented some code in C and Julia, and was surprised about how much slower Julia was. I suspected an inner loop that called cos, so I ran repeated calls to cos in both C and Julia with the following script:

# Using Julia
function cos_j ( start_val :: Float64 , end_val :: Float64 )
    step = (1e-6)
    for t = start_val : step : end_val
        for i = 1:200
            cos ( t )
        end
    end
end

# Using C
function cos_c ( start_val :: Float64 , end_val :: Float64 )
    run ( ‘./ct $start_val $end_val‘)
end

# Force functions to compile first
cos_j (0.0 , 1.0)
run ( ‘gcc ct.c -o ct')
cos_c (0.0 , 1.0)
time ()

# Julia
t1 = time ()
cos_j (0.0 , 1.0)
t2 = time ()
println ( " \ ncos_j (0 ,1) completes in " , round ( t2 - t1 ,3) ," sec " )
t1 = time ()
cos_j (1.0 , 2.0)
t2 = time ()
println ( " cos_j (1 ,2) completes in " , round ( t2 - t1 ,3) ," sec " )

# C
t1 = time ()
cos_c (0.0 , 1.0)
t2 = time ()
println ( " cos_c (0 ,1) completes in " , round ( t2 - t1 ,3) ," sec " )
t1 = time ()
cos_c (1.0 , 2.0)
t2 = time ()
println ( " cos_c (1 ,2) completes in " , round ( t2 - t1 ,3) ," sec " )

The C code:

/∗ ct.c ∗/
#include <stdio.h>
#include <stdlib.h>    // for atof
#include <math.h>    // for cos

int main ( int argc, char ∗argv[] )
{
    double t ;
    double step = 1e−6;
    double startval = atof ( argv [1] ) ;
    double endval = atof ( argv [2] ) ;
    int i ;
    for ( t = startval ; t < endval ; t += step )
    {
        for ( i = 0 ; i < 200 ; i++)
            cos ( t ) ;
    }
    return 0 ;
}

The output from the Julia script:

cos_j (0, 1) completes in 1.267 sec
cos_j (1, 2) completes in 1.917 sec
cos_c (0, 1) completes in 0.388 sec
cos_c (1, 2) completes in 0.384 sec

Not only was the Julia code significantly slower, its execution time varied with the time range, whereas the C code's execution time was relatively constant. I'm not sure how the processor computes cosine, but I assume it involves a Taylor expansion--expanding for the time range (0,1) might be easier.

I assumed that Julia's calls to sin and cos would be the same as C's. Is it possible Julia's call to cos asks for higher precision? Or am I making some mistake in the Julia code (type stability, memory issues, etc.)?

I also tried using ccall, but the results were actually slower than either of these. I didn't include it for brevity, but I can put it up if it helps understand the issue. Any ideas what this could be?

The text was updated successfully, but these errors were encountered:

KristofferC · 2015-08-28T21:52:18Z

Looking at the compiled code with @code_native might be useful.

pao · 2015-08-28T21:52:24Z

Please also provide the output of versioninfo() from the Julia console.

dressel · 2015-08-28T22:00:35Z

julia> versioninfo()
Julia Version 0.3.10
Commit c8ceeef (2015-06-24 13:54 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_LAPACK NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

yuyichao · 2015-08-28T22:06:33Z

Very likely dup of #12830 or #9942

JeffBezanson · 2015-08-28T22:08:46Z

@yuyichao From the run times I wouldn't guess the C compiler is deleting or folding the loop.

yuyichao · 2015-08-28T22:09:12Z

With GCC with default optimization level, the assembly dump actually doesn't call the cos function at all. (and you get a warning saying the cos(t); line is doing nothing with -Wall).

yuyichao · 2015-08-28T22:10:35Z

@JeffBezanson I don't understand why gcc decide to remove the call to cos but leave the loop here.

Here's the disassembly with default optimization level if you want to have a look.

0000000000000660 <cos_c>:
 660:   55                      push   %rbp
 661:   48 89 e5                mov    %rsp,%rbp
 664:   f2 0f 11 45 d8          movsd  %xmm0,-0x28(%rbp)
 669:   f2 0f 11 4d d0          movsd  %xmm1,-0x30(%rbp)
 66e:   48 89 7d c8             mov    %rdi,-0x38(%rbp)
 672:   f2 0f 10 05 56 00 00    movsd  0x56(%rip),%xmm0        # 6d0 <_fini+0xc>
 679:   00 
 67a:   f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)
 67f:   f2 0f 10 45 d8          movsd  -0x28(%rbp),%xmm0
 684:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 689:   eb 28                   jmp    6b3 <cos_c+0x53>
 68b:   48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
 692:   00 
 693:   eb 05                   jmp    69a <cos_c+0x3a>
 695:   48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
 69a:   48 8b 45 f0             mov    -0x10(%rbp),%rax
 69e:   48 3b 45 c8             cmp    -0x38(%rbp),%rax
 6a2:   7c f1                   jl     695 <cos_c+0x35>
 6a4:   f2 0f 10 45 f8          movsd  -0x8(%rbp),%xmm0
 6a9:   f2 0f 58 45 e8          addsd  -0x18(%rbp),%xmm0
 6ae:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 6b3:   f2 0f 10 45 d0          movsd  -0x30(%rbp),%xmm0
 6b8:   66 0f 2e 45 f8          ucomisd -0x8(%rbp),%xmm0
 6bd:   77 cc                   ja     68b <cos_c+0x2b>
 6bf:   90                      nop
 6c0:   5d                      pop    %rbp
 6c1:   c3                      retq

With gcc -O2 most of the loop is gone (I guess because the floating point increment can be infinite loop?).

0000000000000660 <cos_c>:
 660:   66 0f 2e c8             ucomisd %xmm0,%xmm1
 664:   f2 0f 10 15 1c 00 00    movsd  0x1c(%rip),%xmm2        # 688 <_fini+0xc>
 66b:   00 
 66c:   76 0c                   jbe    67a <cos_c+0x1a>
 66e:   66 90                   xchg   %ax,%ax
 670:   f2 0f 58 c2             addsd  %xmm2,%xmm0
 674:   66 0f 2e c8             ucomisd %xmm0,%xmm1
 678:   77 f6                   ja     670 <cos_c+0x10>
 67a:   f3 c3                   repz retq

JeffBezanson · 2015-08-28T22:15:20Z

Well, how about that. Could you try with -O0?

yuyichao · 2015-08-28T22:15:31Z

On the other hand, clang 3.6.2 is not able to remove the cos call even with -O3. It dosen't give a warning that cos(t) is no-op either.

Also, FWIW, with gcc (5.2.0) putting the call to cos in a noinline function disables this optimization and yield the same performance with julia

double __attribute__((noinline))
call_cos(double t)
{
    return cos(t);
}

yuyichao · 2015-08-28T22:17:36Z

Hmm, gcc warns about no op and removes the call to cos even with -O0. (asm here in case I missed the call)

 6ef:   55                      push   %rbp
 6f0:   48 89 e5                mov    %rsp,%rbp
 6f3:   f2 0f 11 45 d8          movsd  %xmm0,-0x28(%rbp)
 6f8:   f2 0f 11 4d d0          movsd  %xmm1,-0x30(%rbp)
 6fd:   48 89 7d c8             mov    %rdi,-0x38(%rbp)
 701:   f2 0f 10 05 57 00 00    movsd  0x57(%rip),%xmm0        # 760 <_fini+0xc>
 708:   00 
 709:   f2 0f 11 45 e8          movsd  %xmm0,-0x18(%rbp)
 70e:   f2 0f 10 45 d8          movsd  -0x28(%rbp),%xmm0
 713:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 718:   eb 28                   jmp    742 <cos_c+0x53>
 71a:   48 c7 45 f0 00 00 00    movq   $0x0,-0x10(%rbp)
 721:   00 
 722:   eb 05                   jmp    729 <cos_c+0x3a>
 724:   48 83 45 f0 01          addq   $0x1,-0x10(%rbp)
 729:   48 8b 45 f0             mov    -0x10(%rbp),%rax
 72d:   48 3b 45 c8             cmp    -0x38(%rbp),%rax
 731:   7c f1                   jl     724 <cos_c+0x35>
 733:   f2 0f 10 45 f8          movsd  -0x8(%rbp),%xmm0
 738:   f2 0f 58 45 e8          addsd  -0x18(%rbp),%xmm0
 73d:   f2 0f 11 45 f8          movsd  %xmm0,-0x8(%rbp)
 742:   f2 0f 10 45 d0          movsd  -0x30(%rbp),%xmm0
 747:   66 0f 2e 45 f8          ucomisd -0x8(%rbp),%xmm0
 74c:   77 cc                   ja     71a <cos_c+0x2b>
 74e:   90                      nop
 74f:   5d                      pop    %rbp
 750:   c3                      retq

JeffBezanson · 2015-08-28T22:19:02Z

~~How does clang's performance compare to julia then?~~
Sorry, didn't see that you mentioned gcc's time when actually calling cos.

yuyichao · 2015-08-28T22:20:01Z

Clang performs the same with julia here. (And I'm using system libm for both.)

JeffBezanson · 2015-08-28T22:20:54Z

@dressel I guess the performance gap in your application must have a different root cause?

yuyichao · 2015-08-28T22:21:21Z

Gcc also performs the same when that optimization is forced off as mentioned above. I think there's another optmization option to tell gcc not to optimize out standard library calls but I couldn't find it now....

yuyichao · 2015-08-28T22:28:11Z

Found it, -fno-builtin disables this optimization for Gcc.

dressel · 2015-08-28T22:57:44Z

Yikes, that was it. When I modify the C file to do something with the output of the cos call (like sum them up), gcc tells me I have an undefined reference to cos. I need to include "-lm" during compilation to include the math library. I suppose it didn't tell me before because it just ignored the call to cos. I'm sorry about that, thank you all for your help.

@JeffBezanson Yeah, it must be something else. I suspected cos because the time range affected the Julia execution time more than in C (in my application, I do use the output of cos).

yuyichao · 2015-08-28T23:00:20Z

@dressel It's probably better to use ccall when comparing performance with c. So that you don't need to pay for the overhead of command line formatting and executing external program. See yuyichao/explore@f97ea4b for my test scripts. (It also avoid the link error because it can load the symbol from julia....)

Edit: P.S. the reason using ccall to call cos is slower in julia is very likely because you are calling the system libm version instead of the openlibm or whatever faster version julia is using by default.

dressel · 2015-08-28T23:41:54Z

@yuyichao Ahh, that is very cool. Using libopenlibm gives roughly the same performance as Julia. Thanks for your tips!

eschnett · 2015-08-29T01:09:11Z

Incidentally, there are #9942 and #12830 (work in progress) addressing this issue. This should LLVM to constant-fold and thus either optimize or completely eliminate these loops. This optimization is also the reason behind the significant speedup discussed in http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/.

yuyichao · 2015-08-29T01:52:34Z

@eschnett #12859 (comment)

SimonEnsemble · 2018-04-27T04:49:06Z

Is this still the case? I spent hours optimizing a function with 10 lines of heavy array computations, and a cosine is still 90% of the computation time.

JeffBezanson added the performance Must go faster label Aug 28, 2015

yuyichao closed this as completed Aug 28, 2015

yuyichao added a commit to yuyichao/explore that referenced this issue Aug 28, 2015

[julia/issue-12859] Add test scripts for JuliaLang/julia#12859

f97ea4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of sin, cos #12859

Performance of sin, cos #12859

dressel commented Aug 28, 2015

KristofferC commented Aug 28, 2015

pao commented Aug 28, 2015

dressel commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

dressel commented Aug 28, 2015

yuyichao commented Aug 28, 2015

dressel commented Aug 28, 2015

eschnett commented Aug 29, 2015

yuyichao commented Aug 29, 2015

SimonEnsemble commented Apr 27, 2018

Performance of sin, cos #12859

Performance of sin, cos #12859

Comments

dressel commented Aug 28, 2015

KristofferC commented Aug 28, 2015

pao commented Aug 28, 2015

dressel commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

JeffBezanson commented Aug 28, 2015

yuyichao commented Aug 28, 2015

yuyichao commented Aug 28, 2015

dressel commented Aug 28, 2015

yuyichao commented Aug 28, 2015

dressel commented Aug 28, 2015

eschnett commented Aug 29, 2015

yuyichao commented Aug 29, 2015

SimonEnsemble commented Apr 27, 2018