-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual vectorization of the accumulation of Partials
#555
Conversation
Codecov Report
@@ Coverage Diff @@
## master #555 +/- ##
==========================================
+ Coverage 85.07% 86.01% +0.94%
==========================================
Files 9 9
Lines 844 901 +57
==========================================
+ Hits 718 775 +57
Misses 126 126
Continue to review full report at Codecov.
|
Is the speed improvement here just from the injected fastmath flags? For example, ForwardDiff already auto-vectorizes seemingly ok:
Did you try just slapping a I think before this can be merged we need to figure out what optimization LLVM misses (or what is causing it not to be able to do that optimization) with the normal tuples and why going to llvmcall is strictly needed. Also, instead of hand-rolling our own LLVM SIMD everywhere, if this is really required, wouldn't SIMD.jl be easier to use? Also, please use |
It is not: ➜ ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
1.289900 seconds (4.68 M allocations: 241.667 MiB, 18.83% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 73 evaluations.
Range (min … max): 870.356 ns … 3.028 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 890.062 ns ┊ GC (median): 0.00%
Time (mean ± σ): 911.691 ns ± 113.366 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▃▂▁▂▂ ▁ ▁
████████▇██▇▇▇▆▆▆▅▅▆▃▄▅▆▅▅▅▄▄▄▃▃▅▃▄▄▅▅▅▅▄▄▃▄▅▄▃▄▂▃▂▂▃▅▂▃▃▂▄▃▅ █
870 ns Histogram: log(frequency) by time 1.47 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
➜ ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
1.137902 seconds (4.68 M allocations: 241.667 MiB, 13.23% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 71 evaluations.
Range (min … max): 882.451 ns … 3.489 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 895.127 ns ┊ GC (median): 0.00%
Time (mean ± σ): 918.756 ns ± 126.937 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇█▄▃▁▁▂ ▁ ▁
███████▇██▇▆▆▅▇▅▄▆▄▃▅▅▅▅▂▄▃▄▅▃▄▅▅▅▅▄▅▄▄▃▃▄▄▄▄▄▃▃▄▄▄▄▃▃▃▄▃▃▃▄▃ █
882 ns Histogram: log(frequency) by time 1.53 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Not necessarily, if you look at the histogram, most of the samples are no where near the min time. Why would you say min is the most close to the truth? If anything it's misleading to believe that this function would really run that fast in a real world setting. Also, since we need to make a cost-benefit analysis here, having more information certainly doesn't harm. (Also, I like those colors, thus, the screenshots. I posted text this time if you prefer that.) Further, I am not sure that changes in the native code don't shift the min/mean/median in unintuitive ways. For instance, maybe the min time is better, yet, the mean is much worse, leading to an overall pessimization in real world use cases.
I first tried to use VectorizationBase, but that leads to compile time regressions. I think it's simple enough that we should just emit the IR. Especially because ForwardDiff is so low in the dependency, we need to be careful about compile time regression. |
The end result is still the optimization of the sum of the parts. For example here, the
Yes, because your computer has noise which is always additive. The fastest run is the one with the least noise. We are not writing an HTTP server here where there is scheduling and we care about percentiles etc. The mode of the samples is close to the min (and the more samples you take the more the mode will move towards the min).
I agree that colors are good if they show something useful. |
That's not true. If that were the case, Maybe you can also run the benchmark to see if there's a speed up.
ForwardDiff.jl doesn't run on perfect computers. I don't see how you can prove changes in native code cannot produce pathological cases that I mentioned in here.
|
I am not saying that the contract is the actual cause, I am saying it could be. Or one of the other fast math flags introduced. Or something else. But it is not magic.
See https://youtu.be/vrfYLlR8X8k?t=915 and the following ~10 min. No one cares about the particular noise of your computer, we want to benchmark the algorithm as well as we can, that's the most unbiased result. Specifically https://youtu.be/vrfYLlR8X8k?t=1487 I am getting some really strange things. If I run with the local ForwardDiff: ~/JuliaPkgs/ForwardDiff.jl tags/v0.10.23*> julia -q --project
julia> using ForwardDiff
julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);
julia> @code_native d * 3.0
.text
; ┌ @ dual.jl:144 within `*'
movq %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
vmovsd (%rsi), %xmm2 # xmm2 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
vmovsd 8(%rsi), %xmm1 # xmm1 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
; │┌ @ float.jl:452 within `isfinite'
; ││┌ @ float.jl:329 within `-'
vsubsd %xmm0, %xmm0, %xmm3
vxorpd %xmm4, %xmm4, %xmm4
; ││└
; ││┌ @ float.jl:401 within `==' @ float.jl:365
vucomisd %xmm4, %xmm3
; │└└
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
vmovsd 16(%rsi), %xmm3 # xmm3 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
jne L33
jnp L72
L33:
vucomisd %xmm4, %xmm1
jne L72
jp L72
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
vxorpd %xmm5, %xmm5, %xmm5
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
vucomisd %xmm5, %xmm3
jne L72
jp L72
; │┌ @ partials.jl:37 within `iszero'
; ││┌ @ partials.jl:167 within `iszero_tuple'
; │││┌ @ partials.jl:172 within `macro expansion'
; ││││┌ @ float.jl:365 within `=='
vmovsd 24(%rsi), %xmm4 # xmm4 = mem[0],zero
.... If I run with the versioned: (@v1.6) pkg> activate --temp
Activating new environment at `/tmp/jl_p9UUFZ/Project.toml`
(jl_p9UUFZ) pkg> add ForwardDiff@0.10.23
julia> using ForwardDiff
julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);
julia> @code_native d * 3.0
.text
; ┌ @ dual.jl:144 within `*'
movq %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
vbroadcastsd %xmm0, %ymm1
vmulpd (%rsi), %ymm1, %ymm1
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:111
; │┌ @ partials.jl:200 within `scale_tuple'
; ││┌ @ partials.jl:157 within `macro expansion'
; │││┌ @ float.jl:332 within `*'
vmulsd 32(%rsi), %xmm0, %xmm0
; │└└└
; │ @ dual.jl:244 within `*'
vmovupd %ymm1, (%rdi)
vmovsd %xmm0, 32(%rdi)
vzeroupper
retq
nop
; └ Why is the local one so trash? Can you repro that..? |
I get: julia> @code_native debuginfo=:none d * 3.0
.section __TEXT,__text,regular,pure_instructions
movq %rdi, %rax
vmulsd (%rsi), %xmm0, %xmm1
vbroadcastsd %xmm0, %ymm0
vmulpd 8(%rsi), %ymm0, %ymm0
vmovsd %xmm1, (%rdi)
vmovupd %ymm0, 8(%rdi)
vzeroupper
retq
nop
What do you mean by magic? SLP is not magic. So maybe you can also accept that LLVM cannot always somehow deduce the way to group unrolled expression into optimal SIMD code. Here's the benchmark without flags with ➜ ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
0.979201 seconds (4.80 M allocations: 243.740 MiB, 13.99% gc time, 99.99% compilation time)
833.165 ns (0 allocations: 0 bytes)
➜ ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
1.007542 seconds (4.80 M allocations: 243.740 MiB, 14.13% gc time, 99.99% compilation time)
842.387 ns (0 allocations: 0 bytes)
➜ ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
0.991858 seconds (4.80 M allocations: 243.740 MiB, 13.58% gc time, 99.99% compilation time)
773.643 ns (0 allocations: 0 bytes) ➜ ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
1.106445 seconds (4.68 M allocations: 241.588 MiB, 20.34% gc time, 99.99% compilation time)
902.947 ns (0 allocations: 0 bytes)
➜ ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
0.952378 seconds (4.68 M allocations: 241.588 MiB, 13.72% gc time, 99.99% compilation time)
903.885 ns (0 allocations: 0 bytes)
➜ ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
0.968903 seconds (4.68 M allocations: 241.588 MiB, 13.50% gc time, 99.99% compilation time)
921.333 ns (0 allocations: 0 bytes) Indeed, I have a noisy computer. But the speed up is still there. |
Ok, I had
locally, which apparently trashes the whole thing.
Thanks, just look how easy that was to read :).
Yes, let's figure out why :) |
With this branch, locally I get:
If I remove the changes to
Can you repro that? |
I cannot observe the difference:
In fact, I changed the div function to return tupexpr(i -> :((println("Hello")); return tup[$i] / x), N) and nothing was printed, so I am pretty sure it won't change anything :-p |
Maybe it is some code layout effect then (https://easyperf.net/blog/2018/01/18/Code_alignment_issues) because it is quite consistent for me. I keep doing it over and over :P Or it was noise. 🤷♂️ Do you get an improvement from the fast math annotations that is added in the LLVM IR? |
I seem to get the same performance (and the same number of instructions using) using SIMD
function scale_tuple(tup::NTuple{N}, x) where N
return @fastmath Tuple(Vec(tup) * x)
end
function div_tuple_by_scalar(tup::NTuple{N}, x) where N
return @fastmath Tuple(Vec(tup) / x)
end
function add_tuples(a::NTuple{N}, b::NTuple{N}) where N
return @fastmath Tuple(Vec(a) + Vec(b))
end
function sub_tuples(a::NTuple{N}, b::NTuple{N}) where N
return @fastmath Tuple(Vec(a) - Vec(b))
end
function minus_tuple(tup::NTuple{N}) where N
return @fastmath Tuple(-Vec(tup))
end
function mul_tuples(a::NTuple{N,V1}, b::NTuple{N,V2}, afactor::S1, bfactor::S2) where {N,V1,V2,S1,S2}
af = Vec{N,V1}(afactor)
bf = Vec{N,V2}(bfactor)
@fastmath Tuple(muladd(Vec(a), af, bf * Vec(b)))
end Removing |
Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>
Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>
Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>
I put up a PR so it is easier to compare: #557 |
Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>
Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>
Incorporated into #557. |
The original goal for manual vectorization was to decrease the compile time, as I thought SLP vectorizer could take a long time to run. Unfortunately, it doesn't seem to improve the compile time at all.
BUT!!! Just when I am about to stash my changes, I found out that this improves the runtime by about 20% for the Jacobian computation of a chemical differential equation system.
Master:
This branch:
The benchmark script: