-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for @simd #5355
Add support for @simd #5355
Conversation
Amazing! |
😺 |
💯 |
Amazing, I look forward to reading this in detail. Even just the TBAA part is great to have. |
@@ -175,6 +175,9 @@ using .I18n | |||
using .Help | |||
push!(I18n.CALLBACKS, Help.clear_cache) | |||
|
|||
# SIMD loops | |||
include("simdloop.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might need a
importall .SimdLoop
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Now added.
Eagerly looking forward to this. |
Likewise. Waiting for this to land. |
One feature of the pull request is that it enables auto-vectorization of some loops without
LLVM will not auto-vectorize it because cannot compute a trip count. Now change The root issue is that the documented way that Julia lowers for loops works just fine for the vectorizer. But there is an undocumented optimization that gets in the way. If a loop has the form
Assume a and b are of type Int. LLVM cannot compute a trip count because the loop is an infinite loop if b=typemax(Int). The "no signed wrap" (see #3929) enables LLVM to disallow this possibility. So I think we should consider one of two changes to short-cut lowering of for loops:
I think an annotation such as @simd is essential to trickier cases where run-time memory disambiguation is impractical. But I think we should consider whether the "short cut" lowering of Comments? |
This seems kind of like a bug in the current lowering, since if b >= a
i = a
while i != b+1
...
i = i+1
end
end Can LLVM compute a trip count in that case? |
Wow, it is quite satisfying that the shortcut hack works worse than the general case :) It looks to me like @simonster 's solution is the only one that will handle full range. However, the
|
If you move the check to the end of the loop, then the fact that it's typemax doesn't matter: i = a - 1
goto check
while true
# body
label check
i < b || break
i += 1
end Edit: fix starting value. |
If you're willing to have an additional branch, then you can avoid the subtraction at the beginning. |
Is the short-cut expected to be semantically equivalent to the long path? E.g., how finicky should we be about which signatures are expected for the types of the bounds? If I understand correctly, the lowering at this point is happening before type inference. Do we have any measurements on what the short-cut is buying in terms of JIT+execution time or code space? I wondering if perhaps the short-cut could be removed and whatever savings it provided could be made up somewhere else in the compilation chain. Here are some tricky examples to consider in proposing shortcuts/semantics:
All of these deliver the correct (or at least obvious :-)) results with the long path, but may go astray with some shortcut solutions. Besides user expectations, something else to consider is the path through the rest of the compilation chain. I suspect that the loop optimizations will almost invariably transform a test-at-top loop into a test-at-bottom loop wrapped in a zero-trip guard, i.e. something like this:
So if we lower a loop into this form in the first place for semantic reasons, we're probably not creating any extra code bloat since the compiler was going to do it anyway. |
Maybe we should remove the special case handling altogether? At this point with range objects being immutable types and the compiler being quite smart about such, I suspect the special case may no longer be necssary. It originally was very necessary because neither of those things were true. |
Without special lowering, we have to make a function call to function f(A)
c = 0.0
for i = 1:10000000
for j = 1:length(A)
@inbounds c += A[j]
end
end
c
end
function g(A)
c = 0.0
for i = 1:10000000
rg = 1:length(A)
for j = rg
@inbounds c += A[j]
end
end
c
end The only difference here should be that julia> @time f(A);
elapsed time: 0.03747795 seconds (64 bytes allocated)
julia> @time f(A);
elapsed time: 0.037112331 seconds (64 bytes allocated)
julia> @time g(A);
elapsed time: 0.066732369 seconds (64 bytes allocated)
julia> @time g(A);
elapsed time: 0.066190191 seconds (64 bytes allocated) If |
Getting rid of the special case would be great. I'll explore what extra inlining might get us here. |
LLVM seems to generate far more compact code with these definitions:
With that plus full inlining I think we will be ok without the special case. Just need to make sure it can still vectorize the result. |
Another idea: use the Otherwise we are faced with the following:
So there are 3 layers of checks, the third of which is redundant when called from More broadly, it looks like a mistake to try to use the same type for integer and floating-point ranges. Floats need the start/step/length representation, but to the extent you want to write |
I verified that the auto-vectorizer can vectorize this example, which I believe is equivalent to code after "full inlining" of @JeffBezanson 's changes to
|
By the way, it's probably good to limit the shortcut to integer loops, or at least avoid any schemes that rely on floating-point induction variables. Otherwise round-off can cause surprises. Here's a surprise with the current Julia:
|
Clearly we need to just remove the special case. That will be a great change. |
this fixes some edge-case loops that the special lowering did not handle correctly. colon() now checks for overflow in computing the length, which avoids some buggy Range1s that used to be possible. this required some changes to make sure Range1 is fast enough: specialized start, done, next, and a hack to avoid one of the checks and allow better inlining. in general performance is about the same, but a few cases are actually faster, since Range1 is now faster (comprehensions used Range1 instead of the special-case lowering, for example). also, more loops should be vectorizable when the appropriate LLVM passes are enabled. all that plus better correctness and a simpler front-end, and I'm sold.
This seems quite sensible. I believe this actually addresses things like ranges of |
Where in the manual should I document |
I would expect to find something like How about |
I like the idea of a new "performance tweaks" chapter |
If we're still planning to implement #2299, I suspect we'll need eventually need a whole chapter just for SIMD. |
@simonster Hopefully not. The autovectorizer of llvm is pretty good and I have doubts that writing hand-written SIMD code is always faster. I made some experience and writting a simple matrix vector multiplication in C with autovectorization is as fast as the SIMD optimized Eigen routines (was using gcc when I tested this) |
I agree that when this lands, #2299 might be less urgent than before. Still, there are plenty of cases where explicit use of SIMD instructions are desired. Latest advancement in compiler technology makes the compilers more intelligent, and they are now able to detect & vectorize simple loops (e.g. mapping and simple reduction, or sometimes matrix multiplication patterns). However, they are still not smart enough to automatically vectorize more complex computation: for example, image filtering, small matrix algebra (where an entire matrix can fit in a small number of AVX registers, and one can finish 8x8 matrix multiplication in less than 100 CPU cycles using carefully crafted SIMD massage, as well as transcendental functions, etc. |
@ArchRobison that article was fantastic! |
Recently there has been work on enabling interleaved memory accesses [1] in llvm. I am wondering how to best use this in combination with the SIMD work |
I see the feature is off by default. Maybe we could enable it with |
I would like to add a vote for some method of doing SIMD by hand, whether it be part of a standard library or a language feature. Probably 90% of the potential benefit of SIMD is not going to be realized with automatic vectorization, and compilers aren't going to bridge that gap significantly ever. Consider for example, implementation of common noise functions like Perlin noise. These involve dozens of steps, a few branches, lookup tables, things the compilers won't be figuring out in my lifetime. My hand written SIMD achieved 3-5x speedup (128 vs 256bit wide varieties) over what the latest compilers manage to do automatically and I am a complete novice. There is a whole universe of applications - games, image processing, video streaming, video editing, physics and number theory research, where programmers are forced to drop down to C or accept code that is 3x->10x slower than it needs to be. With 512bit wide SIMD coming into the market it is too powerful to ignore, and adding good support for SIMD immediately differentiates your language from the other new languages out there which mostly ignore SIMD. |
I've been wanting to write a small library based on |
That would be awesome. Would be great to have simd types and operations within easy reach. |
We could reopen #2299 |
Here we go:
This is with Julia master, using LLVM 3.7.1. LLVM seems to be a bit confused about how to store an array to memory, leading to the ugly |
@eschnett I assume I am to quick, but SIMD.jl is still empty ;) |
Thank you, forgot to push after adding the code. |
@JeffBezanson I notice that Julia tuples are mapped to LLVM arrays, not LLVM vectors. To generate SIMD instructions, one has to convert in a series of Is there a chance to represent tuples as LLVM vectors instead? I'm currently representing SIMD types as |
I'm on sabbatical (four more days!) and largely ignoring email and Github. On Thu, Jan 21, 2016 at 8:25 AM, Erik Schnetter [email protected]
|
It'll be nice if we have a standardized type for llvm vectors since they might be necessary to (c)call some vector math libraries. |
@ArchRobison I'm looking forward to trying your patch. For the record, this is how a simple loop (summing an array of
Only the |
I recall some problems in mapping tuples to vectors, very likely involving alignment, calling convention, and/or bugs in LLVM. It's clear that only a small subset of tuple types can potentially be vector types, so there's ambiguity about whether a given tuple will be a struct or vector or array, which can cause subtle bugs interoperating with native code. |
I had the mapping from tuples to vectors working, with all the fixes for
On Thu, Jan 21, 2016 at 12:44 PM, Jeff Bezanson [email protected]
|
@ArchRobison Did you have time to look for the patch? |
Yes, and I updated it this morning per suggestions from LLVM reviewers while I was out. The patch has two parts: http://reviews.llvm.org/D14185 |
@ArchRobison I'm currently generating SIMD code like this:
That is:
With your patches, would this still be a good way to proceed? |
Yes and no. The patch http://reviews.llvm.org/D14260 deals with optimizing the store. I ran your example through (using
But the load sequence was not optimized. The problem is that |
Yes, emitting scalar operations would be straightforward to do. In the past -- with much older versions of LLVM, and/or with GCC -- it was important to emit arithmetic operations as vector operations since they would otherwise not be synthesized. It seems newer versions of LLVM are much better than this, so this might be the way to go. |
Yay! Success! @ArchRobison Your patch D14260, applied to LLVM 3.7.1, with Julia's master branch and my LLVM-vector version of SIMD, is generating proper SIMD vector instructions without the nonsensical scalarization. Here are two examples of generated AVX2 code (with bounds checking disabled; keeping it enabled still vectorizes the code, but has two additional branches at every loop iteration): Adding two arrays:
Calculating the sum of an array:
Accessing the array elements in the first kernel is still too complicated. I assume that LLVM needs to be told that the two arrays don't overlap with the array descriptors. Also, some loop unrolling is called for. Thanks a million! |
Good to hear it worked. Was that just D14260, or D14260 and D14185[http://reviews.llvm.org/D14185]? (Logically the two diffs belong together, but LLVM review formalities caused the split.) |
This was only D14260. D14185 didn't apply, so I tried without it, and it worked. |
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
This pull request enables the LLVM loop vectorizer.
It's not quite ready for production.I'd like feedback and help fixing some issues. The overall design is explained in this comment to issue #4786, except that it no longer relies on the "banana interface"mentioned in that comment.Here is an example that it can vectorize when
a
is of typeFloat32
, andx
andy
are of typeArray{Float32,1}
:I've seen the vectorized version run 3x faster than the unvectorized version when data fits in cache. When AVX can be enabled, the results are likely even better.
Programmers can put the
@simd
macro in front of one-dimensionalfor
loops that have ranges of the form m:n and the type of the loop index supports<
and `+``. The decoration is guaranteeing that the loop does not rely on wrap-around behavior and the loop iterations are safe to execute in parallel, even if chunks are done in lockstep.The patch implements type-based alias analysis , which may help LLVM optimize better in general, and is essential for vectorization. The name "type-based alias analysis" is a bit of a misnomer, since it's really based on hierarchically partitioning memory. I've implemented it for Julia assuming that type-punning is never done for parts of data structures that users cannot access directly, but that user data can be type-punned freely.
Problems that I seek advice on:The@simd
macro is not found. Currently I have to do the following within the REPL:I tried to copy the way@printf
is defined/exported, but something is wrong with my patch. What?src/llvm-simdloop.cpp
and instead rely on LLVM's auto-vectorization capability, which inserts memory dependence tests. That indeed does work for thesaxpy
example above, i.e. it vectorizes without the supportsrc/llvm-simdloop.cpp
. However,@simd
would still be necessary to tranform the loop into a form such that LLVM can compute a trip count.@simd
altogether and instead somehow ensure that m:n is lowered to a form for which LLVM can compute a trip count.base/simdloop.jl
could use a review by an expert.Apologies for the useless comment:
I just noticed it, but being late on a Friday, I'll fix it later. It's supposed to say that one entry point is for marking simd loops and the other is for later lowering marked loops.
Thanks to @simonster for his information on enabling the loop vectorizer. It was a big help to get me going.