Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use atsign-simd for sum #6928

Merged
merged 2 commits into from
Sep 26, 2014
Merged

Use atsign-simd for sum #6928

merged 2 commits into from
Sep 26, 2014

Conversation

simonster
Copy link
Member

Thanks to @ArchRobison's work in #6926, this version of sum_seq gets auto-vectorized and partially unrolled, but @lindahua's code still beats LLVM on my system (a Core i7-3930K with AVX). With master, I see:

julia> a = randn(1000000000);

julia> @time sum(a)
elapsed time: 0.527515614 seconds (64 bytes allocated)

With this PR, I see:

julia> @time sum(a)
elapsed time: 0.660926643 seconds (64 bytes allocated)

I would expect @simd to be at least slightly faster, but instead, it is slightly slower. I tried adjusting PAIRWISE_SUM_BLOCKSIZE but it makes no difference in performance; @lindahua's code is faster even for naive summation. What gives?

@simonster simonster changed the title Use atsign-simd for sum WIP: Use atsign-simd for sum May 23, 2014
@lindahua
Copy link
Contributor

Can you compare the native codes that these two implementations generate?

@simonster
Copy link
Member Author

Output of code_native and code_llvm for master and this PR is here.

@cbecker
Copy link
Contributor

cbecker commented May 23, 2014

This could also have to do with LLVM's AVX bug, see #6430 (comment)

@ArchRobison
Copy link
Contributor

I noticed that sum_seq has a comment

# a fast implementation of sum in sequential order (from left to right).

The PR should remove the comment wince it's not always true with @simd.

I'll poke around with Amplifier to see if it can tell me something about the performance difference.

@ArchRobison
Copy link
Contributor

@simonster - what model machine generated your example? I.e. the n in Intel "nth" generation or codename (Sandy Bridge, Ivy Bridge, Haswell)?

@simonster
Copy link
Member Author

It's a Sandy Bridge-E system. In case the suboptimal performance was related to the LLVM version, I also tried compiling with LLVM SVN, but then the loop was not vectorized at all and performance was even worse. Thanks for looking into this!

@ArchRobison
Copy link
Contributor

Thanks for the info. I was curious because I'm on a pre-production Haswell and am not seeing any AVX instructions. I'll try a Sandy Bridge. If I get AVX instructions there, I need to track down why they are not showing up for Haswell.

So far, I'm seeing a weird "de-evolution" of LLVM for the@simd version of sum_seq:

  • LLVM 3.3 makes a total mess.
  • LLVM 3.4 generates clean "vector" code, but it's summing four scalar streams, like the old sum_seq does.
  • LLVM trunk seems to give up vectorizing at all, as you saw too.

@ViralBShah
Copy link
Member

Should we be filing an issue against LLVM as a regression?

@ArchRobison
Copy link
Contributor

It's such a trivial but important case that I suspect that the problem is on our end, probably in the switch-over to MCJIT. I need to poke around some more with the LLVM trunk version.

@Keno
Copy link
Member

Keno commented May 27, 2014

Just a heads up on llvm trunk. I'm currently in the process of fixing a bunch of LLVM bugs on trunk that are causing our tests to fail, so don't worry about those.

@ArchRobison
Copy link
Contributor

It seems LLVM trunk has changed the rules for when use-def information is available. I've tracked the problem down to odd behavior for this LLVM construct in ``src/llvm-simdloop.cpp`:

       for (Value::use_iterator UI = I->use_begin(), UE = I->use_end(); UI != UE; ++UI) {
            Instruction *U = cast<Instruction>(*UI);

In prior versions of LLVM, UII iterated over the instructions that use instruction I. But now I'm seeing U==I when I is a Phi, but it should be the instruction that uses the Phi.

Unless the someone knows the solution, I'll start poking through other LLVM trunk passes that employ Value::use_iterator to see what they do.

@simonster
Copy link
Member Author

#8452 may have helped here. @simd is now ~17% faster than the manually unrolled loop on my Sandy Bridge system. With master:

julia> a = randn(10^9);

julia> @time Base.sum(a);
elapsed time: 0.558682776 seconds (820384 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.55152656 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.553031988 seconds (96 bytes allocated)

and with this PR:

julia> @time Base.sum(a);
elapsed time: 0.469242632 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.46907222 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.469068448 seconds (96 bytes allocated)

@simonster
Copy link
Member Author

Will merge later today unless someone tells me not to. The logic seems sound as long as rearranging sysimg.jl this way won't cause problems later.

@JeffBezanson
Copy link
Member

👍

@simonster simonster changed the title WIP: Use atsign-simd for sum Use atsign-simd for sum Sep 26, 2014
simonster added a commit that referenced this pull request Sep 26, 2014
@simonster simonster merged commit f973420 into master Sep 26, 2014
@simonster simonster deleted the sjk/simd-sum branch September 26, 2014 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants