Use atsign-simd for sum #6928

simonster · 2014-05-23T00:27:50Z

Thanks to @ArchRobison's work in #6926, this version of sum_seq gets auto-vectorized and partially unrolled, but @lindahua's code still beats LLVM on my system (a Core i7-3930K with AVX). With master, I see:

julia> a = randn(1000000000);

julia> @time sum(a)
elapsed time: 0.527515614 seconds (64 bytes allocated)

With this PR, I see:

julia> @time sum(a)
elapsed time: 0.660926643 seconds (64 bytes allocated)

I would expect @simd to be at least slightly faster, but instead, it is slightly slower. I tried adjusting PAIRWISE_SUM_BLOCKSIZE but it makes no difference in performance; @lindahua's code is faster even for naive summation. What gives?

lindahua · 2014-05-23T01:59:58Z

Can you compare the native codes that these two implementations generate?

simonster · 2014-05-23T02:16:46Z

Output of code_native and code_llvm for master and this PR is here.

cbecker · 2014-05-23T06:31:34Z

This could also have to do with LLVM's AVX bug, see #6430 (comment)

ArchRobison · 2014-05-23T14:34:03Z

I noticed that sum_seq has a comment

# a fast implementation of sum in sequential order (from left to right).

The PR should remove the comment wince it's not always true with @simd.

I'll poke around with Amplifier to see if it can tell me something about the performance difference.

ArchRobison · 2014-05-23T17:13:44Z

@simonster - what model machine generated your example? I.e. the n in Intel "nth" generation or codename (Sandy Bridge, Ivy Bridge, Haswell)?

simonster · 2014-05-23T17:34:10Z

It's a Sandy Bridge-E system. In case the suboptimal performance was related to the LLVM version, I also tried compiling with LLVM SVN, but then the loop was not vectorized at all and performance was even worse. Thanks for looking into this!

ArchRobison · 2014-05-23T18:23:45Z

Thanks for the info. I was curious because I'm on a pre-production Haswell and am not seeing any AVX instructions. I'll try a Sandy Bridge. If I get AVX instructions there, I need to track down why they are not showing up for Haswell.

So far, I'm seeing a weird "de-evolution" of LLVM for the@simd version of sum_seq:

LLVM 3.3 makes a total mess.
LLVM 3.4 generates clean "vector" code, but it's summing four scalar streams, like the old sum_seq does.
LLVM trunk seems to give up vectorizing at all, as you saw too.

ViralBShah · 2014-05-25T07:44:21Z

Should we be filing an issue against LLVM as a regression?

ArchRobison · 2014-05-27T14:40:30Z

It's such a trivial but important case that I suspect that the problem is on our end, probably in the switch-over to MCJIT. I need to poke around some more with the LLVM trunk version.

Keno · 2014-05-27T14:44:07Z

Just a heads up on llvm trunk. I'm currently in the process of fixing a bunch of LLVM bugs on trunk that are causing our tests to fail, so don't worry about those.

ArchRobison · 2014-06-02T17:04:39Z

It seems LLVM trunk has changed the rules for when use-def information is available. I've tracked the problem down to odd behavior for this LLVM construct in ``src/llvm-simdloop.cpp`:

       for (Value::use_iterator UI = I->use_begin(), UE = I->use_end(); UI != UE; ++UI) {
            Instruction *U = cast<Instruction>(*UI);

In prior versions of LLVM, UII iterated over the instructions that use instruction I. But now I'm seeing U==I when I is a Phi, but it should be the instruction that uses the Phi.

Unless the someone knows the solution, I'll start poking through other LLVM trunk passes that employ Value::use_iterator to see what they do.

simonster · 2014-09-24T01:31:08Z

#8452 may have helped here. @simd is now ~17% faster than the manually unrolled loop on my Sandy Bridge system. With master:

julia> a = randn(10^9);

julia> @time Base.sum(a);
elapsed time: 0.558682776 seconds (820384 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.55152656 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.553031988 seconds (96 bytes allocated)

and with this PR:

julia> @time Base.sum(a);
elapsed time: 0.469242632 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.46907222 seconds (96 bytes allocated)

julia> @time Base.sum(a);
elapsed time: 0.469068448 seconds (96 bytes allocated)

simonster · 2014-09-26T17:00:26Z

Will merge later today unless someone tells me not to. The logic seems sound as long as rearranging sysimg.jl this way won't cause problems later.

JeffBezanson · 2014-09-26T17:03:28Z

👍

Conflicts: base/reduce.jl

Use atsign-simd for sum

Use atsign-simd for sum

191a445

simonster changed the title ~~Use atsign-simd for sum~~ WIP: Use atsign-simd for sum May 23, 2014

ArchRobison mentioned this pull request Jun 2, 2014

Fix for LLVM trunk changes that broke enableUnsafeAlgebraIfReduction. #7087

Merged

simonster mentioned this pull request Sep 10, 2014

Export dotu with generic fallbacks #8300

Closed

simonster changed the title ~~WIP: Use atsign-simd for sum~~ Use atsign-simd for sum Sep 26, 2014

simonster force-pushed the sjk/simd-sum branch from b0c677b to 972ac5a Compare September 26, 2014 17:25

Merge branch 'master' of github.com:JuliaLang/julia into sjk/simd-sum

091c6db

Conflicts: base/reduce.jl

simonster force-pushed the sjk/simd-sum branch from 972ac5a to 091c6db Compare September 26, 2014 17:26

simonster added a commit that referenced this pull request Sep 26, 2014

Merge pull request #6928 from JuliaLang/sjk/simd-sum

f973420

Use atsign-simd for sum

simonster merged commit f973420 into master Sep 26, 2014

simonster deleted the sjk/simd-sum branch September 26, 2014 18:10

simonster mentioned this pull request Jun 3, 2016

performance regression in sum(a) #16185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use atsign-simd for sum #6928

Use atsign-simd for sum #6928

simonster commented May 23, 2014

lindahua commented May 23, 2014

simonster commented May 23, 2014

cbecker commented May 23, 2014

ArchRobison commented May 23, 2014

ArchRobison commented May 23, 2014

simonster commented May 23, 2014

ArchRobison commented May 23, 2014

ViralBShah commented May 25, 2014

ArchRobison commented May 27, 2014

Keno commented May 27, 2014

ArchRobison commented Jun 2, 2014

simonster commented Sep 24, 2014

simonster commented Sep 26, 2014

JeffBezanson commented Sep 26, 2014

Use atsign-simd for sum #6928

Use atsign-simd for sum #6928

Conversation

simonster commented May 23, 2014

lindahua commented May 23, 2014

simonster commented May 23, 2014

cbecker commented May 23, 2014

ArchRobison commented May 23, 2014

ArchRobison commented May 23, 2014

simonster commented May 23, 2014

ArchRobison commented May 23, 2014

ViralBShah commented May 25, 2014

ArchRobison commented May 27, 2014

Keno commented May 27, 2014

ArchRobison commented Jun 2, 2014

simonster commented Sep 24, 2014

simonster commented Sep 26, 2014

JeffBezanson commented Sep 26, 2014