-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use atsign-simd for sum #6928
Use atsign-simd for sum #6928
Conversation
Can you compare the native codes that these two implementations generate? |
Output of |
This could also have to do with LLVM's AVX bug, see #6430 (comment) |
I noticed that
The PR should remove the comment wince it's not always true with I'll poke around with Amplifier to see if it can tell me something about the performance difference. |
@simonster - what model machine generated your example? I.e. the n in Intel "nth" generation or codename (Sandy Bridge, Ivy Bridge, Haswell)? |
It's a Sandy Bridge-E system. In case the suboptimal performance was related to the LLVM version, I also tried compiling with LLVM SVN, but then the loop was not vectorized at all and performance was even worse. Thanks for looking into this! |
Thanks for the info. I was curious because I'm on a pre-production Haswell and am not seeing any AVX instructions. I'll try a Sandy Bridge. If I get AVX instructions there, I need to track down why they are not showing up for Haswell. So far, I'm seeing a weird "de-evolution" of LLVM for the
|
Should we be filing an issue against LLVM as a regression? |
It's such a trivial but important case that I suspect that the problem is on our end, probably in the switch-over to MCJIT. I need to poke around some more with the LLVM trunk version. |
Just a heads up on llvm trunk. I'm currently in the process of fixing a bunch of LLVM bugs on trunk that are causing our tests to fail, so don't worry about those. |
It seems LLVM trunk has changed the rules for when use-def information is available. I've tracked the problem down to odd behavior for this LLVM construct in ``src/llvm-simdloop.cpp`:
In prior versions of LLVM, Unless the someone knows the solution, I'll start poking through other LLVM trunk passes that employ |
#8452 may have helped here. julia> a = randn(10^9);
julia> @time Base.sum(a);
elapsed time: 0.558682776 seconds (820384 bytes allocated)
julia> @time Base.sum(a);
elapsed time: 0.55152656 seconds (96 bytes allocated)
julia> @time Base.sum(a);
elapsed time: 0.553031988 seconds (96 bytes allocated) and with this PR: julia> @time Base.sum(a);
elapsed time: 0.469242632 seconds (96 bytes allocated)
julia> @time Base.sum(a);
elapsed time: 0.46907222 seconds (96 bytes allocated)
julia> @time Base.sum(a);
elapsed time: 0.469068448 seconds (96 bytes allocated) |
Will merge later today unless someone tells me not to. The logic seems sound as long as rearranging |
👍 |
b0c677b
to
972ac5a
Compare
Conflicts: base/reduce.jl
972ac5a
to
091c6db
Compare
Thanks to @ArchRobison's work in #6926, this version of
sum_seq
gets auto-vectorized and partially unrolled, but @lindahua's code still beats LLVM on my system (a Core i7-3930K with AVX). With master, I see:With this PR, I see:
I would expect
@simd
to be at least slightly faster, but instead, it is slightly slower. I tried adjustingPAIRWISE_SUM_BLOCKSIZE
but it makes no difference in performance; @lindahua's code is faster even for naive summation. What gives?