-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of sin, cos #12859
Comments
Looking at the compiled code with |
Please also provide the output of |
|
@yuyichao From the run times I wouldn't guess the C compiler is deleting or folding the loop. |
With GCC with default optimization level, the assembly dump actually doesn't call the cos function at all. (and you get a warning saying the |
@JeffBezanson I don't understand why gcc decide to remove the call to Here's the disassembly with default optimization level if you want to have a look. 0000000000000660 <cos_c>:
660: 55 push %rbp
661: 48 89 e5 mov %rsp,%rbp
664: f2 0f 11 45 d8 movsd %xmm0,-0x28(%rbp)
669: f2 0f 11 4d d0 movsd %xmm1,-0x30(%rbp)
66e: 48 89 7d c8 mov %rdi,-0x38(%rbp)
672: f2 0f 10 05 56 00 00 movsd 0x56(%rip),%xmm0 # 6d0 <_fini+0xc>
679: 00
67a: f2 0f 11 45 e8 movsd %xmm0,-0x18(%rbp)
67f: f2 0f 10 45 d8 movsd -0x28(%rbp),%xmm0
684: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
689: eb 28 jmp 6b3 <cos_c+0x53>
68b: 48 c7 45 f0 00 00 00 movq $0x0,-0x10(%rbp)
692: 00
693: eb 05 jmp 69a <cos_c+0x3a>
695: 48 83 45 f0 01 addq $0x1,-0x10(%rbp)
69a: 48 8b 45 f0 mov -0x10(%rbp),%rax
69e: 48 3b 45 c8 cmp -0x38(%rbp),%rax
6a2: 7c f1 jl 695 <cos_c+0x35>
6a4: f2 0f 10 45 f8 movsd -0x8(%rbp),%xmm0
6a9: f2 0f 58 45 e8 addsd -0x18(%rbp),%xmm0
6ae: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
6b3: f2 0f 10 45 d0 movsd -0x30(%rbp),%xmm0
6b8: 66 0f 2e 45 f8 ucomisd -0x8(%rbp),%xmm0
6bd: 77 cc ja 68b <cos_c+0x2b>
6bf: 90 nop
6c0: 5d pop %rbp
6c1: c3 retq With 0000000000000660 <cos_c>:
660: 66 0f 2e c8 ucomisd %xmm0,%xmm1
664: f2 0f 10 15 1c 00 00 movsd 0x1c(%rip),%xmm2 # 688 <_fini+0xc>
66b: 00
66c: 76 0c jbe 67a <cos_c+0x1a>
66e: 66 90 xchg %ax,%ax
670: f2 0f 58 c2 addsd %xmm2,%xmm0
674: 66 0f 2e c8 ucomisd %xmm0,%xmm1
678: 77 f6 ja 670 <cos_c+0x10>
67a: f3 c3 repz retq |
Well, how about that. Could you try with |
On the other hand, Also, FWIW, with gcc ( double __attribute__((noinline))
call_cos(double t)
{
return cos(t);
} |
Hmm, gcc warns about no op and removes the call to 6ef: 55 push %rbp
6f0: 48 89 e5 mov %rsp,%rbp
6f3: f2 0f 11 45 d8 movsd %xmm0,-0x28(%rbp)
6f8: f2 0f 11 4d d0 movsd %xmm1,-0x30(%rbp)
6fd: 48 89 7d c8 mov %rdi,-0x38(%rbp)
701: f2 0f 10 05 57 00 00 movsd 0x57(%rip),%xmm0 # 760 <_fini+0xc>
708: 00
709: f2 0f 11 45 e8 movsd %xmm0,-0x18(%rbp)
70e: f2 0f 10 45 d8 movsd -0x28(%rbp),%xmm0
713: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
718: eb 28 jmp 742 <cos_c+0x53>
71a: 48 c7 45 f0 00 00 00 movq $0x0,-0x10(%rbp)
721: 00
722: eb 05 jmp 729 <cos_c+0x3a>
724: 48 83 45 f0 01 addq $0x1,-0x10(%rbp)
729: 48 8b 45 f0 mov -0x10(%rbp),%rax
72d: 48 3b 45 c8 cmp -0x38(%rbp),%rax
731: 7c f1 jl 724 <cos_c+0x35>
733: f2 0f 10 45 f8 movsd -0x8(%rbp),%xmm0
738: f2 0f 58 45 e8 addsd -0x18(%rbp),%xmm0
73d: f2 0f 11 45 f8 movsd %xmm0,-0x8(%rbp)
742: f2 0f 10 45 d0 movsd -0x30(%rbp),%xmm0
747: 66 0f 2e 45 f8 ucomisd -0x8(%rbp),%xmm0
74c: 77 cc ja 71a <cos_c+0x2b>
74e: 90 nop
74f: 5d pop %rbp
750: c3 retq |
|
Clang performs the same with julia here. (And I'm using system libm for both.) |
@dressel I guess the performance gap in your application must have a different root cause? |
Gcc also performs the same when that optimization is forced off as mentioned above. I think there's another optmization option to tell gcc not to optimize out standard library calls but I couldn't find it now.... |
Found it, |
Yikes, that was it. When I modify the C file to do something with the output of the cos call (like sum them up), gcc tells me I have an undefined reference to cos. I need to include "-lm" during compilation to include the math library. I suppose it didn't tell me before because it just ignored the call to cos. I'm sorry about that, thank you all for your help. @JeffBezanson Yeah, it must be something else. I suspected cos because the time range affected the Julia execution time more than in C (in my application, I do use the output of cos). |
@dressel It's probably better to use Edit: P.S. the reason using |
@yuyichao Ahh, that is very cool. Using |
Incidentally, there are #9942 and #12830 (work in progress) addressing this issue. This should LLVM to constant-fold and thus either optimize or completely eliminate these loops. This optimization is also the reason behind the significant speedup discussed in http://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/. |
Is this still the case? I spent hours optimizing a function with 10 lines of heavy array computations, and a cosine is still 90% of the computation time. |
I implemented some code in C and Julia, and was surprised about how much slower Julia was. I suspected an inner loop that called
cos
, so I ran repeated calls tocos
in both C and Julia with the following script:The C code:
The output from the Julia script:
Not only was the Julia code significantly slower, its execution time varied with the time range, whereas the C code's execution time was relatively constant. I'm not sure how the processor computes cosine, but I assume it involves a Taylor expansion--expanding for the time range (0,1) might be easier.
I assumed that Julia's calls to sin and cos would be the same as C's. Is it possible Julia's call to cos asks for higher precision? Or am I making some mistake in the Julia code (type stability, memory issues, etc.)?
I also tried using
ccall
, but the results were actually slower than either of these. I didn't include it for brevity, but I can put it up if it helps understand the issue. Any ideas what this could be?The text was updated successfully, but these errors were encountered: