Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for @simd #5355

Merged
merged 9 commits into from
Mar 31, 2014
Merged

Conversation

ArchRobison
Copy link
Contributor

This pull request enables the LLVM loop vectorizer. It's not quite ready for production. I'd like feedback and help fixing some issues. The overall design is explained in this comment to issue #4786, except that it no longer relies on the "banana interface"mentioned in that comment.

Here is an example that it can vectorize when a is of type Float32, and x and y are of type Array{Float32,1}:

function saxpy( a, x, y )
    @simd for i=1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

I've seen the vectorized version run 3x faster than the unvectorized version when data fits in cache. When AVX can be enabled, the results are likely even better.

Programmers can put the @simd macro in front of one-dimensional for loops that have ranges of the form m:n and the type of the loop index supports < and `+``. The decoration is guaranteeing that the loop does not rely on wrap-around behavior and the loop iterations are safe to execute in parallel, even if chunks are done in lockstep.

The patch implements type-based alias analysis , which may help LLVM optimize better in general, and is essential for vectorization. The name "type-based alias analysis" is a bit of a misnomer, since it's really based on hierarchically partitioning memory. I've implemented it for Julia assuming that type-punning is never done for parts of data structures that users cannot access directly, but that user data can be type-punned freely.

Problems that I seek advice on:

  • The @simd macro is not found. Currently I have to do the following within the REPL:
include("base/simdloop.jl")
using SimdLoop.@simd

I tried to copy the way@printf is defined/exported, but something is wrong with my patch. What?

  • LLVM 3.3 disallows attaching metadata to a block, so I've attached it to an instruction in the block. It's kind of ad-hoc, but seems to work. Is there a better way to do it?
  • An alternative to attaching metadata is to eliminate src/llvm-simdloop.cpp and instead rely on LLVM's auto-vectorization capability, which inserts memory dependence tests. That indeed does work for the saxpy example above, i.e. it vectorizes without the support src/llvm-simdloop.cpp. However, @simd would still be necessary to tranform the loop into a form such that LLVM can compute a trip count.
  • An alternative to the trip-count issue is to eliminate @simd altogether and instead somehow ensure that m:n is lowered to a form for which LLVM can compute a trip count.
  • I'm a neophyte at writing macros, base/simdloop.jl could use a review by an expert.

Apologies for the useless comment:

This file defines two entry points:

I just noticed it, but being late on a Friday, I'll fix it later. It's supposed to say that one entry point is for marking simd loops and the other is for later lowering marked loops.

Thanks to @simonster for his information on enabling the loop vectorizer. It was a big help to get me going.

@simonster
Copy link
Member

Amazing!

@jiahao
Copy link
Member

jiahao commented Jan 10, 2014

😺

@johnmyleswhite
Copy link
Member

💯

@JeffBezanson
Copy link
Member

Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have.

@@ -175,6 +175,9 @@ using .I18n
using .Help
push!(I18n.CALLBACKS, Help.clear_cache)

# SIMD loops
include("simdloop.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need a

importall .SimdLoop

here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Now added.

@lindahua
Copy link
Contributor

Eagerly looking forward to this.

@ViralBShah
Copy link
Member

Likewise. Waiting for this to land.

@ArchRobison
Copy link
Contributor Author

One feature of the pull request is that it enables auto-vectorization of some loops without @simd. But it's quirky, and the underlying reason for the quirkiness needs discussion because with a small change, we might be able to enable wider use of auto-vectorization in Julia. Consider the following example:

function saxpy( a, x, y )
    for i in 1:length(x)
        @inbounds y[i] = y[i]+a*x[i];
    end
end

LLVM will not auto-vectorize it because cannot compute a trip count. Now change 1:length(x) to (1:length(x))+0. Then (with the current PR) the example does vectorize!

The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form for i in a:b, then it is custom-lowered differently. (See 'for in src/julia-syntax.scm). The custom lowering likely helps compilation time by short-cutting through a lot of analysis and transform. Regrettably it puts the loop in a form where LLVM cannot compute a trip count. Here's sketch of the form (I'm abstracting out some details):

i = a
while i<=b 
    ...
    i = i+1

Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:

  • Somehow set the "no signed wrap" flag on the right add instruction, by using an intrinsic per the suggestion of @simonster.
  • Change the lowering to:
i = a
while i<b+1
    ...
    i = i+1

I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of for loops should be more friendly to auto-vectorization.

Comments?

@simonster
Copy link
Member

This seems kind of like a bug in the current lowering, since for i = typemax(Int):typemax(Int); end should probably not be an infinite loop. Changing the lowering to i < b+1 would cause a loop ending in typemax(Int) not to be executed at all, which is still not quite right (although if the current behavior is acceptable, this seems equally acceptable). If we care about handling loops ending in typemax(Int), it seems like we could lower to:

if b >= a
  i = a
  while i != b+1
      ...
      i = i+1
  end
end

Can LLVM compute a trip count in that case?

@JeffBezanson
Copy link
Member

Wow, it is quite satisfying that the shortcut hack works worse than the general case :)
This is indeed a bug.

It looks to me like @simonster 's solution is the only one that will handle full range. However, the Range type used in the general case can only have up to typemax(Int) elements. The special-case lowering could mimic that:

n = b-a+1
# error if range too big
c = 0
while c < n
    i = c+a
    ...
    c = c+1
end

@StefanKarpinski
Copy link
Member

If you move the check to the end of the loop, then the fact that it's typemax doesn't matter:

i = a - 1
goto check
while true
    # body
    label check
    i < b || break
    i += 1
end

Edit: fix starting value.

@StefanKarpinski
Copy link
Member

If you're willing to have an additional branch, then you can avoid the subtraction at the beginning.

@ArchRobison
Copy link
Contributor Author

Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain.

Here are some tricky examples to consider in proposing shortcuts/semantics:

for i=0.0:.1:.25   # Fun with floating-point round.  Tripcount should be 3.
       println(i)
end
for j=typemin(Int):typemin(Int)+1   # Tripcount should be 2.
       println(j)
end
for k=typemax(Int)-1:typemax(Int) # Tripcount should be 2
       println(k)
end

All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions.

Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:

if (loop-test) {
      loop-preheader (compute loop invariants, initialize induction variables)
      do {
          loop body
      } while(loop-test);
}

So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway.

@StefanKarpinski
Copy link
Member

Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true.

@simonster
Copy link
Member

Without special lowering, we have to make a function call to colon, which has to call the Range1 constructor. This appears to have noticeable overhead if the time to execute the loop is short. Let:

function f(A)
    c = 0.0
    for i = 1:10000000
        for j = 1:length(A)
            @inbounds c += A[j]
        end
    end
    c
end

function g(A)
    c = 0.0
    for i = 1:10000000
        rg = 1:length(A)
        for j = rg
            @inbounds c += A[j]
        end
    end
    c
end

The only difference here should be that f(A) gets the special lowering whereas g(A) does not. For A = rand(5), after compilation, f(A) is consistently almost twice as fast:

julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)

julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)

julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated)

If A = rand(100), the difference is almost non-existent, but I don't think we should deoptimize small loops. OTOH, if we could fully inline colon and the optimizer can elide the non-negative length check for Range1 construction, maybe this would generate the same code as @JeffBezanson's proposal.

@JeffBezanson
Copy link
Member

Getting rid of the special case would be great. I'll explore what extra inlining might get us here.

@JeffBezanson
Copy link
Member

LLVM seems to generate far more compact code with these definitions:

start(r::Range1) = r.start
next{T}(r::Range1{T}, i) = (i, oftype(T, i+1))
done(r::Range1, i) = i==(r.start+r.len)

With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result.

@JeffBezanson
Copy link
Member

Another idea: use the Range1 type only for integers, and have it store start and stop instead of length. That way the start and stop values can simply be accepted with no checks, and the length method can throw an overflow error if the length can't be represented as an Int. The reason for this is that computing the length is the hard part, and you often don't need it.

Otherwise we are faced with the following:

  1. Check stop<start, set length to 0 if so
  2. Compute checked_add(checked_sub(stop,start),1) to check for over-long ranges
  3. Call Range1 constructor, which must check length<0 in case somebody calls the constructor directly

So there are 3 layers of checks, the third of which is redundant when called from colon. We could have a hidden unsafe constructor that elides check (3), for use by colon, but that's kind of a hack and only addresses a small piece.

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

@ArchRobison
Copy link
Contributor Author

I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to Range1.

function saxpy( a, x, y )
    r = 1:length(x)
    s = r.start
    while !(s==(r.start+r.len))
        (i,s) = (s,oftype(Int,s+1))
        @inbounds y[i] = y[i]+a*x[i];
    end
end

@ArchRobison
Copy link
Contributor Author

By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:

a=2.0^53
b=a+2
r = a:b
for i in r        # Performs 3 iterations as expected
    println(i)
end
for i in a:b      # Infinite loop
    println(i)
end

@JeffBezanson
Copy link
Member

Clearly we need to just remove the special case. That will be a great change.

JeffBezanson added a commit that referenced this pull request Jan 17, 2014
this fixes some edge-case loops that the special lowering did not
handle correctly.

colon() now checks for overflow in computing the length, which avoids
some buggy Range1s that used to be possible.

this required some changes to make sure Range1 is fast enough:
specialized start, done, next, and a hack to avoid one of the checks and
allow better inlining.

in general performance is about the same, but a few cases are actually
faster, since Range1 is now faster (comprehensions used Range1 instead
of the special-case lowering, for example). also, more loops should be
vectorizable when the appropriate LLVM passes are enabled. all that
plus better correctness and a simpler front-end, and I'm sold.
@StefanKarpinski
Copy link
Member

More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write start:step:stop with integers, you're better off keeping those same three numbers since you get more range.

This seems quite sensible. I believe this actually addresses things like ranges of Char, BigInt, and other non-traditional types that you might want ranges of. There was another example recently, which I don't recall.

@ArchRobison
Copy link
Contributor Author

Where in the manual should I document @simd? It's fundamentally about relaxing control flow, so doc/manual/control-flow.rst is a logical place. However, @simd is a bit esoteric and might be a distraction there. It's different than the parallel programming model, so doc/manual/parallel-computing.rst doesn't seem like the right place. Should I give @simd its own section in the manual?

@ivarne
Copy link
Member

ivarne commented Jan 22, 2014

I would expect to find something like @inbounds and @simd to be in a performance chapter. They are both about making the user do something that ideally would be the compilers job.

How about performance-tips.rst?

@jiahao
Copy link
Member

jiahao commented Jan 22, 2014

I like the idea of a new "performance tweaks" chapter

@simonster
Copy link
Member

If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD.

@tknopp
Copy link
Contributor

tknopp commented Jan 22, 2014

@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this)

@lindahua
Copy link
Contributor

I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired.

Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns).

However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc.

@jakebolewski
Copy link
Member

@ArchRobison that article was fantastic!

@vchuravy
Copy link
Member

Recently there has been work on enabling interleaved memory accesses [1] in llvm. I am wondering how to best use this in combination with the SIMD work

[1] http://reviews.llvm.org/rL239291

@ArchRobison
Copy link
Contributor Author

I see the feature is off by default. Maybe we could enable it with -O? My initial take is that the poster child for vectorizing interleaved memory access is complex arithmetic, but typically that involves complex multiplications which will require more work in LLVM to vectorize.

@jackmott
Copy link

I would like to add a vote for some method of doing SIMD by hand, whether it be part of a standard library or a language feature. Probably 90% of the potential benefit of SIMD is not going to be realized with automatic vectorization, and compilers aren't going to bridge that gap significantly ever. Consider for example, implementation of common noise functions like Perlin noise. These involve dozens of steps, a few branches, lookup tables, things the compilers won't be figuring out in my lifetime. My hand written SIMD achieved 3-5x speedup (128 vs 256bit wide varieties) over what the latest compilers manage to do automatically and I am a complete novice. There is a whole universe of applications - games, image processing, video streaming, video editing, physics and number theory research, where programmers are forced to drop down to C or accept code that is 3x->10x slower than it needs to be. With 512bit wide SIMD coming into the market it is too powerful to ignore, and adding good support for SIMD immediately differentiates your language from the other new languages out there which mostly ignore SIMD.

@iamed2
Copy link
Contributor

iamed2 commented Jan 19, 2016

@jackmott You may be able to manually vectorize using llvmcall, but that would require knowledge of LLVM IR

@eschnett
Copy link
Contributor

I've been wanting to write a small library based on NTuple and llvmcall for some time...

@JeffBezanson
Copy link
Member

That would be awesome. Would be great to have simd types and operations within easy reach.

@JeffBezanson
Copy link
Member

We could reopen #2299

@eschnett
Copy link
Contributor

Here we go:

julia> workspace(); using SIMD; code_native(sqrt, (Vec{4,Float64},))
    .section    __TEXT,__text,regular,pure_instructions
Filename: SIMD.jl
Source line: 0
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 186
    vsqrtpd (%rsi), %ymm0
    vextractf128    $1, %ymm0, %xmm1
Source line: 5
    vmovhpd %xmm1, 24(%rdi)
    vmovlpd %xmm1, 16(%rdi)
    vmovhpd %xmm0, 8(%rdi)
    vmovlpd %xmm0, (%rdi)
    movq    %rdi, %rax
    popq    %rbp
    vzeroupper
    retq

This is with Julia master, using LLVM 3.7.1. LLVM seems to be a bit confused about how to store an array to memory, leading to the ugly vmov sequence in the end, but the actual vectorization works like a charm. See https://github.com/eschnett/SIMD.jl for the proof of concept.

@vchuravy
Copy link
Member

@eschnett I assume I am to quick, but SIMD.jl is still empty ;)

@eschnett
Copy link
Contributor

Thank you, forgot to push after adding the code.

@eschnett
Copy link
Contributor

@JeffBezanson I notice that Julia tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD instructions, one has to convert in a series of extractvalue and insertelement instructions. Unfortunately, it turns out that LLVM (3.7, x86-64) is not good at optimizing these, leading at certain occasions to cumbersome generated code that breaks vectors into scalars and re-assembles them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these can be efficiently bitcast to LLVM vector types. That leads to efficient code, but is more complex on the Julia side.

@ArchRobison
Copy link
Contributor Author

I'm on sabbatical (four more days!) and largely ignoring email and Github.
But apropos to this issue, I have an extant LLVM patch that fixes the
"cumbersome code" problem that Erik observed. The patch was developed
after I discovered from experience that mapping tuples to LLVM vectors was
not going to work well.

On Thu, Jan 21, 2016 at 8:25 AM, Erik Schnetter [email protected]
wrote:

@JeffBezanson https://github.com/JeffBezanson I notice that Julia
tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD
instructions, one has to convert in a series of extractvalue and
insertelement instructions. Unfortunately, it turns out that LLVM (3.7,
x86-64) is not good at optimizing these, leading at certain occasions to
cumbersome generated code that breaks vectors into scalars and re-assembles
them.

Is there a chance to represent tuples as LLVM vectors instead?

I'm currently representing SIMD types as bitstype in Julia, since these
can be efficiently bitcast to LLVM vector types. That leads to efficient
code, but is more complex on the Julia side.


Reply to this email directly or view it on GitHub
#5355 (comment).

@yuyichao
Copy link
Contributor

It'll be nice if we have a standardized type for llvm vectors since they might be necessary to (c)call some vector math libraries.

@eschnett
Copy link
Contributor

@ArchRobison I'm looking forward to trying your patch.

For the record, this is how a simple loop (summing an array of Float64) currently looks:

L224:
    vmovq   %rdx, %xmm0
    vmovq   %rbx, %xmm1
    vunpcklpd   %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0],xmm1[0]
    vmovq   %rdi, %xmm1
    vmovq   %rsi, %xmm2
    vunpcklpd   %xmm2, %xmm1, %xmm1 ## xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm1, %ymm0, %ymm0
    vaddpd  (%rcx), %ymm0, %ymm0
    vextractf128    $1, %ymm0, %xmm1
    vpextrq $1, %xmm1, %rsi
    vmovq   %xmm1, %rdi
    vpextrq $1, %xmm0, %rbx
    vmovq   %xmm0, %rdx
    addq    $32, %rcx
    addq    $-4, %rax
    jne L224

Only the add instructions are real; the move, extract, unpack, and insert instructions are strictly redundant.

@JeffBezanson
Copy link
Member

I recall some problems in mapping tuples to vectors, very likely involving alignment, calling convention, and/or bugs in LLVM. It's clear that only a small subset of tuple types can potentially be vector types, so there's ambiguity about whether a given tuple will be a struct or vector or array, which can cause subtle bugs interoperating with native code.

@ArchRobison
Copy link
Contributor Author

I had the mapping from tuples to vectors working, with all the fixes for
alignment. It was messily context sensitive. But that wasn't the
show-stopper. What killed it was that it hurt performance as often as it
helped. My conclusion was that the mapping to vectors needs to happen much
later in the compilation pipeline, when LLVM can be sure it will likely pay
off. On Monday, when I return to the office after my 9 week absence, I'll
track down the status of review my LLVM patch. (It's context is probably
bit-rotted by now.)

  • Arch

On Thu, Jan 21, 2016 at 12:44 PM, Jeff Bezanson [email protected]
wrote:

I recall some problems in mapping tuples to vectors, very likely involving
alignment, calling convention, and/or bugs in LLVM. It's clear that only a
small subset of tuple types can potentially be vector types, so there's
ambiguity about whether a given tuple will be a struct or vector or array,
which can cause subtle bugs interoperating with native code.


Reply to this email directly or view it on GitHub
#5355 (comment).

@eschnett
Copy link
Contributor

@ArchRobison Did you have time to look for the patch?

@ArchRobison
Copy link
Contributor Author

Yes, and I updated it this morning per suggestions from LLVM reviewers while I was out. The patch has two parts:

http://reviews.llvm.org/D14185
http://reviews.llvm.org/D14260

@eschnett
Copy link
Contributor

@ArchRobison I'm currently generating SIMD code like this:

julia> @code_llvm Vec{2,Float64}(1) + Vec{2,Float64}(2)

define void @"julia_+_23864.1"(%Vec.12* sret, %Vec.12*, %Vec.12*) #0 {
top:
  %3 = getelementptr inbounds %Vec.12, %Vec.12* %1, i64 0, i32 0
  %4 = load [2 x double], [2 x double]* %3, align 8
  %5 = getelementptr inbounds %Vec.12, %Vec.12* %2, i64 0, i32 0
  %6 = load [2 x double], [2 x double]* %5, align 8
  %arg1arr_0.i = extractvalue [2 x double] %4, 0
  %arg1_0.i = insertelement <2 x double> undef, double %arg1arr_0.i, i32 0
  %arg1arr_1.i = extractvalue [2 x double] %4, 1
  %arg1.i = insertelement <2 x double> %arg1_0.i, double %arg1arr_1.i, i32 1
  %arg2arr_0.i = extractvalue [2 x double] %6, 0
  %arg2_0.i = insertelement <2 x double> undef, double %arg2arr_0.i, i32 0
  %arg2arr_1.i = extractvalue [2 x double] %6, 1
  %arg2.i = insertelement <2 x double> %arg2_0.i, double %arg2arr_1.i, i32 1
  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %res_0.i = extractelement <2 x double> %res.i, i32 0
  %resarr_0.i = insertvalue [2 x double] undef, double %res_0.i, 0
  %res_1.i = extractelement <2 x double> %res.i, i32 1
  %resarr.i = insertvalue [2 x double] %resarr_0.i, double %res_1.i, 1
  %7 = getelementptr inbounds %Vec.12, %Vec.12* %0, i64 0, i32 0
  store [2 x double] %resarr.i, [2 x double]* %7, align 8
  ret void
}

That is:

  • a sequence of extractvalue/insertelement to convert the Julia tuple/LLVM array to an LLVM vector
  • a single LLVM vector operation (here: add)
  • a sequence of extractelement/insertvalue to convert the LLVM vector back to a LLVM array/Julia tuple

With your patches, would this still be a good way to proceed?
Or should this be a sequence of scalar operations instead, omitting the insert-/extractelement statements?

@ArchRobison
Copy link
Contributor Author

Yes and no. The patch http://reviews.llvm.org/D14260 deals with optimizing the store. I ran your example through (using %Vec.12 = type { [2 x double] }, and the store was indeed to:

  %res.i = fadd <2 x double> %arg1.i, %arg2.i
  %7 = bitcast %Vec.12* %0 to <2 x double>*
  store <2 x double> %res.i, <2 x double>* %7, align 8
  ret void

But the load sequence was not optimized. The problem is that http://reviews.llvm.org/D14185 is targeting the situation where the tuple code is still fully scalar LLVM IR (such as this example from the unit tests), not partially vectorize code as in your example. For what you are doing, is it practical to generate fully scalar LLVM IR? Or do we need to consider adding another instruction-combining transform to LLVM?

@eschnett
Copy link
Contributor

Yes, emitting scalar operations would be straightforward to do.

In the past -- with much older versions of LLVM, and/or with GCC -- it was important to emit arithmetic operations as vector operations since they would otherwise not be synthesized. It seems newer versions of LLVM are much better than this, so this might be the way to go.

@eschnett
Copy link
Contributor

Yay! Success!

@ArchRobison Your patch D14260, applied to LLVM 3.7.1, with Julia's master branch and my LLVM-vector version of SIMD, is generating proper SIMD vector instructions without the nonsensical scalarization.

Here are two examples of generated AVX2 code (with bounds checking disabled; keeping it enabled still vectorizes the code, but has two additional branches at every loop iteration):

Adding two arrays:

L176:
    movq    (%r15), %rdx
Source line: 766
    vmovupd (%rcx,%rdx), %ymm0
Source line: 458
    movq    (%rbx), %rsi
Source line: 419
    vaddpd  (%rcx,%rsi), %ymm0, %ymm0
Source line: 803
    vmovupd %ymm0, (%rcx,%rdx)
    movq    %r14, -64(%rbp)
Source line: 62
    addq    $32, %rcx
    addq    $-4, %rax
    jne L176

Calculating the sum of an array:

L128:
    vaddpd  (%rcx), %ymm0, %ymm0
Source line: 55
    addq    $32, %rcx
    addq    $-4, %rax
    jne L128

Accessing the array elements in the first kernel is still too complicated. I assume that LLVM needs to be told that the two arrays don't overlap with the array descriptors. Also, some loop unrolling is called for.

Thanks a million!

@ArchRobison
Copy link
Contributor Author

Good to hear it worked. Was that just D14260, or D14260 and D14185[http://reviews.llvm.org/D14185]? (Logically the two diffs belong together, but LLVM review formalities caused the split.)

@eschnett
Copy link
Contributor

This was only D14260. D14185 didn't apply, so I tried without it, and it worked.

eschnett added a commit to eschnett/julia that referenced this pull request Feb 7, 2016
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
	vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
	addq	%rsi, %rdx
	addq	%rcx, %rdi
	jne	L192
```

which is perfect.
eschnett added a commit to eschnett/julia that referenced this pull request Feb 7, 2016
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
    vunpcklpd   %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
    vunpcklpd   %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
    vinsertf128 $1, %xmm3, %ymm1, %ymm1
    vmovupd 8(%rcx), %xmm2
    vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
    vaddpd  %ymm2, %ymm1, %ymm1
    vpermilpd   $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
    vextractf128    $1, %ymm1, %xmm3
    vpermilpd   $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
    vaddsd  (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
	vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
	addq	%rsi, %rdx
	addq	%rcx, %rdi
	jne	L192
```

which is perfect.
eschnett added a commit to eschnett/julia that referenced this pull request Feb 8, 2016
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>.

Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment.

Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>.

Without this patch, the loop kernel looks like (x86-64, AVX2 instructions):

```
vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0]
vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0]
vinsertf128 $1, %xmm3, %ymm1, %ymm1
vmovupd 8(%rcx), %xmm2
vinsertf128 $1, 24(%rcx), %ymm2, %ymm2
vaddpd %ymm2, %ymm1, %ymm1
vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0]
vextractf128 $1, %ymm1, %xmm3
vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0]
Source line: 62
vaddsd (%rcx), %xmm0, %xmm0
```

Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning.

With this patch, the loop kernel looks like:

```
L192:
vaddpd	(%rdx), %ymm1, %ymm1
Source line: 62
addq	%rsi, %rdx
addq	%rcx, %rdi
jne	L192
```

which is perfect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.