-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault when creating a Vec{7, Float64} #1
Comments
@KristofferC The package is certainly ready for issues. I'm just not sure how much I can do in this case. I can reproduce the problem locally. This seems to be a segfault in LLVM (although technically, it might also be Julia's fault). The input I'm giving to LLVM is a string, containing instructions in the LLVM language http://llvm.org/docs/LangRef.html. I would expect that LLVM either reports an error, or knows how to interpret the instruction. Here, you seem to have come across a bug in LLVM. Most CPU architectures support native vector widths that are a power of two. Thus, I don't know yet what a good work-around would be. Can you restrict your code to vector sizes that are powers of 2? What if you introduce "unused" values? Or maybe the SIMD package should do this. I would also like to submit a bug report to LLVM. For this I'm going to need the LLVM code that Julia actually passes to LLVM (which is different from what SIMD passes to Julia). I'll have to investigate how to get this. No, ArchRobinson's patch is not necessary to generate vectorized code. It does, however, improve code quality (and presumably performance) in certain cases. I plan to submit a PR to Julia with this patch. |
I don't know if it helps but just calling |
This is from GDB...
|
Just fyi, I don't get the example to vectorize on my julia installation. The generated LLVM is: https://gist.github.com/KristofferC/3561adc06fd328812293 Looks like one problem is the type stability of julia> @code_warntype vadd!(a, b)
Variables:
#self#::#vadd!
xs::Array{Float32,1}
ys::Array{Float32,1}
N::Int64
#s22::Int64
i::Int64
yv::Any
xv::Any |
The segfaults happens in LLVM, as it is translating the LLVM code to machine code. This points to a bug in LLVM. The function name In your statement Here is how you use julia> using SIMD
julia> function vadd!{N,T}(xs::Vector{T}, ys::Vector{T}, ::Type{Vec{N,T}})
@assert length(ys) == length(xs)
@assert length(xs) % N == 0
@inbounds for i in 1:N:length(xs)
xv = vload(Vec{N,T}, xs, i)
yv = vload(Vec{N,T}, ys, i)
xv += yv
vstore(xv, xs, i)
end
end
vadd! (generic function with 1 method)
julia> @code_llvm vadd!(Vector{Float64}(), Vector{Float64}(), Vec{4,Float64})
julia> @code_native vadd!(Vector{Float64}(), Vector{Float64}(), Vec{4,Float64}) In the output of
where the Similarly, in the output of
where the |
Sorry, my xs::Array{Float32,1}
ys::Array{Float32,1} ) I tried with I however only tried with the exact function from the README. I will try your version where you specify the Type of the |
I see. I have some local changes in my repository -- maybe they corrected an error that I didn't see. |
With the function you just posted I get the vectorized instructions. With the one in the README, I don't (because of some type instability indicated by |
If you are interesting, I did some battle testing with SIMD.jl: JuliaDiff/ForwardDiff.jl#98 (comment) The generated code contains packed instructions but is unfortunately a factor 2x slower than the old NTuple code. It might have to do with my implementations at JuliaDiff/ForwardDiff.jl@3fcd796#diff-8e9d15484d2c994e12d78a9a7c2c8201R122 It seems that multiplication of a julia> Vec{3,Float32}((1.0,2.0,2.0))*3
3-element Array{Float32,1}:
3.0
6.0
6.0 |
Thanks for pointing me to the README; I was looking at the test suite, which contains a similar function. Yes, the function in the README is too clever for Julia's type inference mechanism. I have corrected it. |
Mixed scalar/vector operations are not yet implemented. It is currently necessary to explicitly convert the scalar to a vector first. In your case, Julia apparently realizes that a I will add the missing type promotion rules. |
Also: Kudos for moving into benchmarking. I'm still at the "getting things to work" stage, albeit with masked vector load/store operations that are probably not relevant to you. |
I looked a bit closer at the generated code for the function: function simd_sum{T}(x::Vector{T})
s = zero(T)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end The immutable Ttuple{N, T}
val::T
partials::NTuple{N, T}
end or immutable TVec{N, T}
val::T
partials::Vec{N, T}
end where +(a::TVec, b::Tvec) = TVec(a.val + b.val, a.partials + b.partials) I ran the code with
while the
I am not good at all at assembly but it seems to arrange the data into the AVX registers Unfortunately in this case it seems to be about 2 * slower than the |
@KristofferC The kind of code you show improved very much for me when using Arch Robison's second patch. Without the patch, the packing and unpacking you see makes vectorization useless. Are you using the patch? |
No, I am not. I will try with it! Thanks for being so responsive :) |
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
Hm, I did a new clone of julia, checked out your recent PR and built everything and I still get the packing / unpacking into the scalar registers. If you want to reproduce: function simd_sum{T}(x::Vector{T})
s = SIMD.create(T, 0.0)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end
vec = [Vec{4, Float64}((rand(), rand(), rand(), rand())) for i in 1:500]
@code_native simd_sum(vec) Edit: Changed the repo so that it only uses SIMD.jl and not ForwardDiff. |
The LLVM code for the loop kernel is:
The code looks reasonable, apart from the usual vector/array conversions that Arch Robison's patch should delete during code generation. Compared to other loops, the vector is here nested inside another structure, hence the multi-layer I believe we need a stronger version of @ArchRobison's D14260 patch. An alternative (I don't know how difficult to implement, though) would be to explicitly add LLVM's vector types to Julia, e.g. as a new version of |
I edited my post above to only include a raw vector of
Do you get the same? Maybe it is the same thing since the |
For your edited post I get
In the future, can you just create a new post instead of editing an old one? This makes it possible for me to look at your old example again, which is now kind of difficult for me since I didn't save it. |
To reiterate: I have |
Arch Robison proposed the patch <http://reviews.llvm.org/D14260> "Optimize store of "bitcast" from vector to aggregate" for LLVM. This patch applies cleanly to LLVM 3.7.1. It seems to be the last missing puzzle piece on the LLVM side to allow generating efficient SIMD instructions via `llvm_call` in Julia. For an example package, see e.g. <https://github.com/eschnett/SIMD.jl>. Some discussion relevant to this PR are in JuliaLang#5355. @ArchRobison, please comment. Julia stores tuples as LLVM arrays, whereas LLVM SIMD instructions require LLVM vectors. The respective conversions are unfortunately not always optimized out unless the patch above is applied, leading to a cumbersome sequence of instructions to disassemble and reassemble a SIMD vector. An example is given here <eschnett/SIMD.jl#1 (comment)>. Without this patch, the loop kernel looks like (x86-64, AVX2 instructions): ``` vunpcklpd %xmm4, %xmm3, %xmm3 # xmm3 = xmm3[0],xmm4[0] vunpcklpd %xmm2, %xmm1, %xmm1 # xmm1 = xmm1[0],xmm2[0] vinsertf128 $1, %xmm3, %ymm1, %ymm1 vmovupd 8(%rcx), %xmm2 vinsertf128 $1, 24(%rcx), %ymm2, %ymm2 vaddpd %ymm2, %ymm1, %ymm1 vpermilpd $1, %xmm1, %xmm2 # xmm2 = xmm1[1,0] vextractf128 $1, %ymm1, %xmm3 vpermilpd $1, %xmm3, %xmm4 # xmm4 = xmm3[1,0] Source line: 62 vaddsd (%rcx), %xmm0, %xmm0 ``` Note that the SIMD vector is kept in register `%ymm1`, but is unnecessarily scalarized into registers `%xmm{0,1,2,3}` at the end of the kernel, and re-assembled in the beginning. With this patch, the loop kernel looks like: ``` L192: vaddpd (%rdx), %ymm1, %ymm1 Source line: 62 addq %rsi, %rdx addq %rcx, %rdi jne L192 ``` which is perfect.
Yes, of course. My original example was: Pkg.checkout("ForwardDiff", "kc/simd")
using ForwardDiff
using SIMD
function simd_sum{T}(x::Vector{T})
s = zero(T)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end
vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]
@code_native simd_sum(vec) Maybe the patch didn't get applied for me. I will look into it further, |
Yes, I failed somehow and the patch didn't apply at my last test. Now it does and for the Hm, maybe Archs patch can be tweaked for the |
Patch Yes, I am also worried about the fragility here. |
I'll take a look at |
What do I need to do to reproduce the
|
You need to |
If you want to get rid of the external using SIMD
immutable GradientNumber{N, T}
val::T
partials::Vec{N,T}
end
Base.zero{N,T}(::Type{GradientNumber{N, T}}) = GradientNumber(zero(T), zero(Vec{N,T}))
function Base.(:+){N,T}(g1::GradientNumber{N, T}, g2::GradientNumber{N, T})
return GradientNumber(g1.val + g2.val, g1.partials + g2.partials)
end
function simd_sum{T}(x::Vector{T})
s = zero(T)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end
code_llvm(simd_sum, (Vector{GradientNumber{4, Float64}},))
code_native(simd_sum, (Vector{GradientNumber{4, Float64}},)) |
Thanks. Examples with fewer dependencies always appreciated. Now I'm stuck on:
I also tried:
Evidently I'm missing some context. I'm working off Julia sources pulled on Monday. |
Sorry, I should have been more thorough. You need this repo that we are talking in but it is not registered, that is |
For some more context, the problem is that the code I just posted above have the unnecessary packing and unpacking into
|
Using Your example is nice because it exposes several issues where the current pattern matching designed to help SLP vectorizer deal with Julia tuples is not enough to handle what's coming out of Maybe for packages already engaging in llvmcall machinations it would make sense to have a way to tell the compiler to use a LLVM vector. But I'm not ready to give up quite yet. |
@ArchRobison Did you have time to look into this issue? I've looked at Julia to see where the decision regarding LLVM structures, arrays, and vectors is made. I can think of two ways to introduce LLVM vectors to Julia:
I don't know how difficult it would be to introduce a new tuple-like type to Julia. It seems overkill. I don't know how robustly one can "flag" a type in Julia. For testing, looking at the type's name would suffice for me. I think LLVM vectors are supposed to be more strictly aligned than structures or arrays. In the code I generated, I can circumvent this by specifying a weaker alignment. Is there other code that is also relevant? For example, would Julia automatically generate code to access or copy these vectors that then might break due to alignment restrictions? Any advice you have would be help. |
I pondered it a bit. If LLVM had a cast operation from vector to aggregate types, and vice-versa, we wouldn't have a problem. Adding LLVM intrinsics for the casts might work, but that's a pain to maintain for each target machine. Lacking a cast, my approach had been to recognize the idioms that perform such casts, e.g. extracting each element and inserting it into an aggregate. The trouble with the approach is that it depends on being able to rewrite the idiom. In the context of load or store, that's easy -- just do pointer type punning. But the
Looking back though the definition chains for where an outer aggregate was loaded from won't help either, because type-punning the outer aggregate to a different aggregate type won't always work, since changing a component from aggregate to vector might change it's alignment. E.g., I would dread complicating the fundamental Julia type system for this use case, both from the viewpoint of teaching users and complicating type inference. The flagged type approach seems worth examining. The flagged type needs a name (because just using a tuple won't do since it could be confused with ordinary tuples). At the Julia level, the flagged type could be an immutable datatype with a single field that is a homogeneous tuple. That gives it a name to distinguish it while staying within the current Julia type system. The flag would indicate that the field should be mapped to an LLVM vector. Generation of LLVM code gets trickier, but I had most of that figured out for an earlier patch that I had abandoned. The flag would need to live in I'm not sure what the best approach for the Julia syntax would be. Maybe have an intrinsic type constructor? Or a macro that operates on a stylized |
I experimented with a setup such as you suggested in https://github.com/eschnett/julia/tree/eschnett/vector-types. I'm simply keying the flag on the type name ( I tried putting the flag into the surrounding immutable type, not into the contained tuple type. Is this a good idea? This means that I need to pass a flag Also, I didn't manage to observe any change in behaviour. I would have expected Julia to access the tuple as I didn't look at struct layout, offsets, or alignments either. |
I was initially thinking the flag should go on the surrounding immutable type, but as you point out, that necessitates passing the flag as a parameter in places. (I ended up in the same boat when trying to automatically lay out some tuples as LLVM vectors.) I was worried about putting the flag on the tuple type itself because then we would have two tuple types that differ only by the hidden flag and confuse reflection. But maybe that works out right? They are different types and the internals of I recall there is lots of pointer type-punning in the generation of LLVM code, so it's possible the punning undid your changes. (LLVM is moving from typed pointers to typed loads/stores anyway; then the punning will be free.) I'm wondering about trying an inverted flagging scheme, like this:
The inversion would give us kosher Julia types that are clearly distinguished from ordinary tuples. I think it would avoid the flag-passing hassle. We would need to modify
|
First, please say if you don't think this package is ready for any issues.
I always get a segfault when loading a length 7
Vec
from aFloat64
array.Also, is ArchRobinsons recent patch to LLVM needed for this package to properly vectorize?
The text was updated successfully, but these errors were encountered: