-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple reductions of Vectors of ForwardDiff numbers fail to use SIMD #98
Comments
Unfortunately, I am not well versed enough in LLVM IR to be able to pinpoint exactly what the problem is... |
This generates very similar code and also fails to vectorize: function tuple_simd_sum{T}(x::Vector{NTuple{4, T}})
s = (0.0, 0.0, 0.0, 0.0)
@inbounds @simd for i in eachindex(x)
x_i = x[i]
s = (s[1] + x_i[1], s[2] + x_i[2], s[3] + x_i[3], s[4] + x_i[4])
end
return s
end
tuple_vec = [(rand(), rand(), rand(), rand()) for i = 1:20]
@code_llvm tuple_simd_sum(tuple_vec) |
If moving from |
I don't know what other storage type that would be. A vector of GradientNumber is already packed efficiently in contiguous memory so I don't think the storage type is wrong, it is just that it is translated to LLVM IR in a non optimal way. There are some patches pending to LLVM that might help JuliaLang/julia#5355 (comment) It would be interesting to experiment with https://github.com/eschnett/SIMD.jl when it is mature enough. |
Using I think manually calling SIMD instructions might be the most consistent way to resolve this issue (instead of having to rely on the |
I made a small proof of concept using Doing something like: using ForwardDiff
using SIMD
function simd_sum{T}(x::Vector{T})
s = zero(T)
@simd for i in eachindex(x)
@inbounds s = s + x[i]
end
return s
end
vec = [ForwardDiff.GradientNumber{4, Float64, Vec{4, Float64}}(1.0, ForwardDiff.Partials(Vec{4, Float64}((rand(), rand(), rand(), rand())))) for i in 1:500]
@code_native simd_sum(vec) show a bunch of vectorized operations being used. It is however about twice as slow as the original code hehe. Maybe my implementation is not perfect or SIMD.jl needs a bit more battle experience. Still pretty interesting. |
I made another proof of concept on getting the operations with the partial numbers becoming SIMD accelerated. First I started on a package I then made a The result is quite nicely generated SIMD code: include(simd_fdiff.jl)
using SIMDVectors
a = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
b = GradientNumber(rand(), Partials(load(SIMDVector{9}, rand(9))))
@code_native a + b
.
vaddpd 16(%rdx), %xmm5, %xmm5
vaddpd 32(%rdx), %xmm4, %xmm4
vaddpd 48(%rdx), %xmm3, %xmm3
vaddpd 64(%rdx), %xmm2, %xmm2
vaddsd 80(%rdx), %xmm1, %xmm1
vaddsd (%rdx), %xmm0, %xmm0
.
@code_native a * b
julia> @code_native a * b
.
vmulpd 32(%rsi), %xmm1, %xmm10
vmulpd 48(%rsi), %xmm1, %xmm11
vmulpd 64(%rsi), %xmm1, %xmm8
Source line: 6
vmulsd 80(%rsi), %xmm0, %xmm5
Source line: 34
vmovsd (%rdx), %xmm6 # xmm6 = mem[0],zero
Source line: 8
vmovddup %xmm6, %xmm7 # xmm7 = xmm6[0,0]
vmulpd 16(%rdx), %xmm7, %xmm1
vmulpd 32(%rdx), %xmm7, %xmm2
vmulpd 48(%rdx), %xmm7, %xmm3
vmulpd 64(%rdx), %xmm7, %xmm7
Source line: 6
vmulsd 80(%rdx), %xmm6, %xmm4
Source line: 8
vaddpd %xmm1, %xmm9, %xmm1
vaddpd %xmm2, %xmm10, %xmm2
vaddpd %xmm3, %xmm11, %xmm3
vaddpd %xmm7, %xmm8, %xmm7
Source line: 6
vaddsd %xmm4, %xmm5, %xmm4
vmulsd %xmm6, %xmm0, %xmm0
.```
|
Using
which uses lovely AVX 256 bit instructions. The example in the OP also runs faster. |
This is so cool. I suppose now the race is on for me to wrap up #102 before JuliaLang/julia#15244 lands. |
There are of course a few caveats, SIMD Vectors are limited to length 16 and It shouldn't be that much code since only simple operations are needed on |
I guess that means no nested differentiation as well.. |
SIMD + nested differentiation is not a use case that I care too much about |
So how much faster will a multiplication of two 16-length gradient numbers be? 4x faster? |
Theoretically yes, but I'm segfaulting now when I run |
Couldn't |
The best would be if whatever is used to hold the partials would just degenerate to a normal tuple when the type is not something that simd would work on. As in https://github.com/KristofferC/SIMDVectors.jl#user-defined-number-types. |
@jrevels I was just reading about and experimenting with Julia 0.5's SIMD support. Are there any technical hurdles to implementing SIMD optimizations for the As I think you suggested earlier, it seems like immutable Partials{N,T}
values::NTuple{N,VecElement{T}}
end With the corresponding
|
Note
This is what SIMD.jl is doing. |
@KristofferC Understood. Also looked at SIMD.jl. I thought @jrevels expressed the opinion that he would prefer to insert the Is SIMD.jl mature enough to be used as a dependency at this time? |
Since we do not require many fancy operations on the partials it should be possible to just roll our own. When I experimented with this I got random crashes when trying to benchmark but perhaps the situation had improved. |
Attempting to answer my first question about this approach, i.e. what is the cost of defining the partials as I created a branch of ForwardDiff.jl where the Partials are defined in this way (for Julia 0.5 only). No SIMD optimizations have been added (yet) -- only the storage of the partials is changed. For 1st derivatives, the performance seems to be unchanged (expected). However, I'm seeing a 2-4x slowdown when using nested It would be really great to change the definition of type Is there any good way to do this? Otherwise, we may need to have multiple types of Thoughts? |
What is the problem with using VecElement if T is not a bits type? It should degenerate to exactly the same LLVM representation as a tuple of T for non SIMD supported T. |
The problem with nesting |
How does the generated code look for something really simple like adding a two nested dual numbers with VecElement / Tuple? |
@KristofferC Please see the attached files which show the generated code and timings for sizes of {1,3,6,9} x {1,3,6,9}. In the 1x1, 3x1, 6x1, and 9x1 cases, there is virtually no difference in the generated code. However, as the number of partials are increased in the nested dual, the generated code becomes considerably lengthier (with the 9x9 case being quite lengthy, indeed.) This seems to imply that dualtest.jl.txt UPDATE: It appears that the generated code for (m by n; n > 1) duals is suffering from unnecessary allocations and memory copy calls. |
Hopefully that works. However, I don't think there is any inherit reason why nested |
Update: My above approach won't work. Thanks @KristofferC for encouraging me to further distill the problem and file an issue. |
Feel free to read JuliaLang/julia#18857. It's possible that introducing the |
One of the problems with using I have been working on supporting vector intrinsics in base Julia JuliaLang/julia#18470, but I didn't make much progress on that recently. (The problem is what to do in the case that LLVM is not available). The goal of that work was to prevent people from having to roll their own llvmcall implementations. (LLVM IR is not stable) In the mean time I would recommend using SIMD.jl it is a very nice package and it solved several of the hard problems already. |
@dbeach24 Just caught up with this issue after returning from vacation, thanks for looking into it. I'm a SIMD noob, and haven't really been following recent developments in Julia/the ecosystem (thanks for your insight, @vchuravy). Upon revisiting this, I've just found that LLVM's SLP auto-vectorizer now seems to work in some cases for Julia v0.5. Lo and behold: ➜ data julia -O3 -q
julia> using ForwardDiff: Dual
julia> a = Dual(1., (2., 3.))
Dual(1.0,2.0,3.0)
julia> b = Dual(4., (5., 6.))
Dual(4.0,5.0,6.0)
julia> @code_llvm a + b
define void @"julia_+_70898"(%Dual* noalias sret, %Dual*, %Dual*) #0 {
top:
%3 = getelementptr inbounds %Dual, %Dual* %1, i64 0, i32 1, i32 0, i64 1
%4 = load double, double* %3, align 8
%5 = getelementptr inbounds %Dual, %Dual* %2, i64 0, i32 1, i32 0, i64 1
%6 = load double, double* %5, align 8
%7 = fadd double %4, %6
%8 = bitcast %Dual* %1 to <2 x double>*
%9 = load <2 x double>, <2 x double>* %8, align 8
%10 = bitcast %Dual* %2 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = fadd <2 x double> %9, %11
%13 = bitcast %Dual* %0 to <2 x double>*
store <2 x double> %12, <2 x double>* %13, align 8
%14 = getelementptr inbounds %Dual, %Dual* %0, i64 0, i32 1, i32 0, i64 1
store double %7, double* %14, align 8
ret void
}
julia> @code_native a + b
.section __TEXT,__text,regular,pure_instructions
Filename: dual.jl
pushq %rbp
movq %rsp, %rbp
Source line: 114
vmovsd 16(%rsi), %xmm0 ## xmm0 = mem[0],zero
vaddsd 16(%rdx), %xmm0, %xmm0
Source line: 182
vmovupd (%rsi), %xmm1
vaddpd (%rdx), %xmm1, %xmm1
vmovupd %xmm1, (%rdi)
vmovsd %xmm0, 16(%rdi)
movq %rdi, %rax
popq %rbp
retq
nopw %cs:(%rax,%rax) If I don't use Note that the example in this PR's initial description also now auto-vectorizes for me (in the LLVM IR + native assembly with |
@jrevels Very interesting, indeed. If Julia's auto-vectorizer is generally smart enough to vectorize the innermost partials of a nested Dual type, then perhaps there is no point in a manual optimization effort? |
Also noted in the end of https://github.com/JuliaArrays/StaticArrays.jl/blob/master/README.md#speed. However, I don't think the auto vectorizer ever generate the more advanced SIMD stuff like 4x Float64. |
It's working for me (also works for the julia> a = Dual(1., (2., 3., 4., 5.))
Dual(1.0,2.0,3.0,4.0,5.0)
julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0,5.0)
julia> @code_llvm a + b
define void @"julia_+_70874"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
%3 = getelementptr inbounds %Dual.66, %Dual.66* %1, i64 0, i32 1, i32 0, i64 3
%4 = load double, double* %3, align 8
%5 = getelementptr inbounds %Dual.66, %Dual.66* %2, i64 0, i32 1, i32 0, i64 3
%6 = load double, double* %5, align 8
%7 = fadd double %4, %6
%8 = bitcast %Dual.66* %1 to <4 x double>*
%9 = load <4 x double>, <4 x double>* %8, align 8
%10 = bitcast %Dual.66* %2 to <4 x double>*
%11 = load <4 x double>, <4 x double>* %10, align 8
%12 = fadd <4 x double> %9, %11
%13 = bitcast %Dual.66* %0 to <4 x double>*
store <4 x double> %12, <4 x double>* %13, align 8
%14 = getelementptr inbounds %Dual.66, %Dual.66* %0, i64 0, i32 1, i32 0, i64 3
store double %7, double* %14, align 8
ret void
} It can even vectorize operations on the values + partials simultaneously, if the lengths work out correctly: julia> a = Dual(1., 2., 3., 4.)
Dual(1.0,2.0,3.0,4.0)
julia> b = copy(a)
Dual(1.0,2.0,3.0,4.0)
julia> @code_llvm a + b
define void @"julia_+_71231"(%Dual.74* noalias sret, %Dual.74*, %Dual.74*) #0 {
top:
%3 = bitcast %Dual.74* %1 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = bitcast %Dual.74* %2 to <4 x double>*
%6 = load <4 x double>, <4 x double>* %5, align 8
%7 = fadd <4 x double> %4, %6
%8 = bitcast %Dual.74* %0 to <4 x double>*
store <4 x double> %7, <4 x double>* %8, align 8
ret void
}
Possibly. I guess it could depend on the context-sensitivity of the auto-vectorizer (it's LLVM's SLP vectorizer, BTW, not Julia's) - I don't know if it works in every case. So far, though, it seems to work where I'd expect it to work. My original goal was to vectorize arithmetic operations on Anyway, do we consider this issue closed since it now works with |
Just had to make sure this works, and it does (soooo cool that this works): julia> a = Dual(Dual(1., 2.), Dual(3., 4.))
Dual(Dual(1.0,2.0),Dual(3.0,4.0))
julia> b = copy(a)
Dual(Dual(1.0,2.0),Dual(3.0,4.0))
julia> @code_llvm a + b
define void @"julia_+_70918"(%Dual.66* noalias sret, %Dual.66*, %Dual.66*) #0 {
top:
%3 = bitcast %Dual.66* %1 to <4 x double>*
%4 = load <4 x double>, <4 x double>* %3, align 8
%5 = bitcast %Dual.66* %2 to <4 x double>*
%6 = load <4 x double>, <4 x double>* %5, align 8
%7 = fadd <4 x double> %4, %6
%8 = bitcast %Dual.66* %0 to <4 x double>*
store <4 x double> %7, <4 x double>* %8, align 8
ret void
} |
This line I don't think we currently can define a convert method between |
I tried various operations on nested duals, and it appears that the SIMD instructions appear in more complex cases as well -- as long as I specify I did not see UPDATE: Apparently |
There are many factors that are at play here:
It is a shame that we can't do: immutable Double
x :: Float64
y :: Float64
end
reinterpret(NTuple{2,VecElement{Float64}}, Double(1.0, 2.0)) but this works: |
Now we just need to lobby to enable the vector passes with default -O level :) |
Would be interesting to run the benchmarks with and without -O3 to see if it gives any cool speedups. |
Yeah, I saw similar perf differences on my machine. Wonder what's causing the slowdown for some of the benchmarks? It could be from some other pass enabled via |
Perhaps the way forward is just to document this and close the issue? I'm pretty mindblown how good code that is generated with O3... When did this start happening (or have we just not tried it before)? Speedup of Rosenbrock with almost a factor of 3 is pretty great, no? |
I agree with this.
I played around with trying to get the SLP vectorizer to actually trigger/emit vectorized instructions for ForwardDiff's old number types back in the days of Julia v0.3/early v0.4, but never succeeded. When @dbeach24 started reviving discussion here, I remembered that I should try again with the new
Yeah, I was really excited to see that 😃 |
Is there any way we can add a test for regressions here? |
We could have a bunch of tests like: @test contains(sprint(io -> Base.code_llvm(io, +, (typeof(a), typeof(b)))), "fadd <4 x double>") Would this be too fragile? |
It would fail if you forgot to run with |
Well yeah, we'd have to put -O3 in the Travis script to catch SIMD regressions. The LLVM IR generated should be the same either way, no (i.e. it's just that the native code won't include SIMD)? Do we emit different IR based on the architecture? I thought it was LLVM's job to abstract over the architecture. |
For local testing, we might be able to get the optimization level from within Julia, then only run the SIMD tests if it's |
IR is also different depending on sysimg is locally compiled or not. |
A standrad reduction function like:
which vectorizes with
Vector{Float64}
fails to vectorize with ForwardDiff numbers. For example:It is possible that there is nothing that can be done on the ForwardDiff side to solve this but I thought it could be interesting to track nonetheless.
The text was updated successfully, but these errors were encountered: