Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manual vectorization of the accumulation of Partials #555

Closed
wants to merge 4 commits into from

Conversation

YingboMa
Copy link
Member

The original goal for manual vectorization was to decrease the compile time, as I thought SLP vectorizer could take a long time to run. Unfortunately, it doesn't seem to improve the compile time at all.

BUT!!! Just when I am about to stash my changes, I found out that this improves the runtime by about 20% for the Jacobian computation of a chemical differential equation system.

Master:
image

This branch:
image

The benchmark script:

using DiffEqBase, ForwardDiff, BenchmarkTools
const k1=.35e0
const k2=.266e2
const k3=.123e5
const k4=.86e-3
const k5=.82e-3
const k6=.15e5
const k7=.13e-3
const k8=.24e5
const k9=.165e5
const k10=.9e4
const k11=.22e-1
const k12=.12e5
const k13=.188e1
const k14=.163e5
const k15=.48e7
const k16=.35e-3
const k17=.175e-1
const k18=.1e9
const k19=.444e12
const k20=.124e4
const k21=.21e1
const k22=.578e1
const k23=.474e-1
const k24=.178e4
const k25=.312e1

function rhs!(dy,y,p,t)
    @inbounds begin
        r1  = k1 *y[1]
        r2  = k2 *y[2]*y[4]
        r3  = k3 *y[5]*y[2]
        r4  = k4 *y[7]
        r5  = k5 *y[7]
        r6  = k6 *y[7]*y[6]
        r7  = k7 *y[9]
        r8  = k8 *y[9]*y[6]
        r9  = k9 *y[11]*y[2]
        r10 = k10*y[11]*y[1]
        r11 = k11*y[13]
        r12 = k12*y[10]*y[2]
        r13 = k13*y[14]
        r14 = k14*y[1]*y[6]
        r15 = k15*y[3]
        r16 = k16*y[4]
        r17 = k17*y[4]
        r18 = k18*y[16]
        r19 = k19*y[16]
        r20 = k20*y[17]*y[6]
        r21 = k21*y[19]
        r22 = k22*y[19]
        r23 = k23*y[1]*y[4]
        r24 = k24*y[19]*y[1]
        r25 = k25*y[20]

        dy[1]  = -r1-r10-r14-r23-r24+
        r2+r3+r9+r11+r12+r22+r25
        dy[2]  = -r2-r3-r9-r12+r1+r21
        dy[3]  = -r15+r1+r17+r19+r22
        dy[4]  = -r2-r16-r17-r23+r15
        dy[5]  = -r3+r4+r4+r6+r7+r13+r20
        dy[6]  = -r6-r8-r14-r20+r3+r18+r18
        dy[7]  = -r4-r5-r6+r13
        dy[8]  = r4+r5+r6+r7
        dy[9]  = -r7-r8
        dy[10] = -r12+r7+r9
        dy[11] = -r9-r10+r8+r11
        dy[12] = r9
        dy[13] = -r11+r10
        dy[14] = -r13+r12
        dy[15] = r14
        dy[16] = -r18-r19+r16
        dy[17] = -r20
        dy[18] = r20
        dy[19] = -r21-r22-r24+r23+r25
        dy[20] = -r25+r24
    end
    nothing
end
f! = DiffEqBase.UJacobianWrapper(rhs!, 0.0, nothing)

u0 = zeros(20)
u0[2]  = 0.2
u0[4]  = 0.04
u0[7]  = 0.1
u0[8]  = 0.3
u0[9]  = 0.01
u0[17] = 0.007

J = similar(u0, length(u0), length(u0))
du = similar(u0)
cfg = ForwardDiff.JacobianConfig(f!, du, u0, ForwardDiff.Chunk(8))
#tinf = @snoopi_deep ForwardDiff.jacobian!(J, f!, du, u0, cfg)
@time ForwardDiff.jacobian!(J, f!, du, u0, cfg)
display(@benchmark ForwardDiff.jacobian!($J, $f!, $du, $u0, $cfg))
println()

@codecov-commenter
Copy link

codecov-commenter commented Nov 19, 2021

Codecov Report

Merging #555 (e50d653) into master (0af523a) will increase coverage by 0.94%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #555      +/-   ##
==========================================
+ Coverage   85.07%   86.01%   +0.94%     
==========================================
  Files           9        9              
  Lines         844      901      +57     
==========================================
+ Hits          718      775      +57     
  Misses        126      126              
Impacted Files Coverage Δ
src/partials.jl 89.53% <100.00%> (+5.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0af523a...e50d653. Read the comment docs.

@YingboMa
Copy link
Member Author

With chunk size = 4, there is about 10% improvement.

Master:
image

This branch:
image

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

Is the speed improvement here just from the injected fastmath flags? For example, ForwardDiff already auto-vectorizes seemingly ok:

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vbroadcastsd    %xmm0, %ymm1
        vmulpd  (%rsi), %ymm1, %ymm1
...

julia> @code_native d + d
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ dual.jl:139 within `+'
        movq    %rdi, %rax
; │ @ dual.jl:422 within `+' @ float.jl:326
        vmovupd (%rsi), %ymm0
        vaddpd  (%rdx), %ymm0, %ymm0
...

Did you try just slapping a @fastmath on the partial computations?

I think before this can be merged we need to figure out what optimization LLVM misses (or what is causing it not to be able to do that optimization) with the normal tuples and why going to llvmcall is strictly needed. Also, instead of hand-rolling our own LLVM SIMD everywhere, if this is really required, wouldn't SIMD.jl be easier to use?

Also, please use @btime unless the histogram and all other numbers are of any use (which it isn't here since we just want the value most close to the truth (which the minimum already shows best))

@YingboMa
Copy link
Member Author

YingboMa commented Nov 19, 2021

For example, ForwardDiff already auto-vectorizes seemingly ok:

@code_native on a small expression is highly misleading. How well SLP vectorizer performs depends on the context, if that is inlined in some complex code with indexing and such, it won't produce the same code quality.

Is the speed improvement here just from the injected fastmath flags?

It is not:

➜  ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
  1.289900 seconds (4.68 M allocations: 241.667 MiB, 18.83% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 73 evaluations.
 Range (min  max):  870.356 ns    3.028 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     890.062 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   911.691 ns ± 113.366 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇▃▂▁▂▂  ▁                                                   ▁
  ████████▇██▇▇▇▆▆▆▅▅▆▃▄▅▆▅▅▅▄▄▄▃▃▅▃▄▄▅▅▅▅▄▄▃▄▅▄▃▄▂▃▂▂▃▅▂▃▃▂▄▃▅ █
  870 ns        Histogram: log(frequency) by time       1.47 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
➜  ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
  1.137902 seconds (4.68 M allocations: 241.667 MiB, 13.23% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 71 evaluations.
 Range (min  max):  882.451 ns    3.489 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     895.127 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   918.756 ns ± 126.937 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▄▃▁▁▂  ▁                                                    ▁
  ███████▇██▇▆▆▅▇▅▄▆▄▃▅▅▅▅▂▄▃▄▅▃▄▅▅▅▅▄▅▄▄▃▃▄▄▄▄▄▃▃▄▄▄▄▃▃▃▄▃▃▃▄▃ █
  882 ns        Histogram: log(frequency) by time       1.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Also, please use @Btime unless the histogram and all other numbers are of any use (which it isn't here since we just want the value most close to the truth (which the minimum already shows best))

Not necessarily, if you look at the histogram, most of the samples are no where near the min time. Why would you say min is the most close to the truth? If anything it's misleading to believe that this function would really run that fast in a real world setting. Also, since we need to make a cost-benefit analysis here, having more information certainly doesn't harm. (Also, I like those colors, thus, the screenshots. I posted text this time if you prefer that.) Further, I am not sure that changes in the native code don't shift the min/mean/median in unintuitive ways. For instance, maybe the min time is better, yet, the mean is much worse, leading to an overall pessimization in real world use cases.

Also, instead of hand-rolling our own LLVM SIMD everywhere, if this is really required, wouldn't SIMD.jl be easier to use?

I first tried to use VectorizationBase, but that leads to compile time regressions. I think it's simple enough that we should just emit the IR. Especially because ForwardDiff is so low in the dependency, we need to be careful about compile time regression.

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

@code_native on a small expression is highly misleading. How well SLP vectorizer performs depends on the context, if that is inlined in some complex code with indexing and such, it won't produce the same code quality.

The end result is still the optimization of the sum of the parts. For example here, the contract fast math option used here could perhaps make a difference when doing optimization on the whole.

Not necessarily, if you look at the histogram, most of the samples are no where near the min time. Why would you say min is the most close to the truth?

Yes, because your computer has noise which is always additive. The fastest run is the one with the least noise. We are not writing an HTTP server here where there is scheduling and we care about percentiles etc. The mode of the samples is close to the min (and the more samples you take the more the mode will move towards the min).

Also, I like those colors, thus, the screenshots

I agree that colors are good if they show something useful.

@YingboMa
Copy link
Member Author

The end result is still the optimization of the sum of the parts. For example here, the contract fast math option used here could perhaps make a difference when doing optimization on the whole.

That's not true. If that were the case, --math-mode=fast should speed the master branch to this branch's speed as well. If you actually see the native code for the rhs function, the manual vectorization is better. Also, the speed up is proven empirically, so I don't see the dispute. SLP objectively produces worse code if there's complex code surrounding the unrolled part as shown in the benchmark.

Maybe you can also run the benchmark to see if there's a speed up.

Yes, because your computer has noise which is always additive. The fastest run is the one with the least noise. We are not writing an HTTP server here where there is scheduling and we care about percentiles etc. The mode of the samples is close to the min (and the more samples you take the more the mode will move towards the min).

ForwardDiff.jl doesn't run on perfect computers. I don't see how you can prove changes in native code cannot produce pathological cases that I mentioned in here.

Further, I am not sure that changes in the native code don't shift the min/mean/median in unintuitive ways. For instance, maybe the min time is better, yet, the mean is much worse, leading to an overall pessimization in real world use cases.

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

That's not true. If that were the case, --math-mode=fast should speed the master branch to this branch's speed as well. If you actually see the native code for the rhs function, the manual vectorization is better.

I am not saying that the contract is the actual cause, I am saying it could be. Or one of the other fast math flags introduced. Or something else. But it is not magic.

ForwardDiff.jl doesn't run on perfect computers. I don't see how you can prove changes in native code cannot produce pathological cases that I mentioned in here.

See https://youtu.be/vrfYLlR8X8k?t=915 and the following ~10 min. No one cares about the particular noise of your computer, we want to benchmark the algorithm as well as we can, that's the most unbiased result. Specifically https://youtu.be/vrfYLlR8X8k?t=1487


I am getting some really strange things. If I run with the local ForwardDiff:

~/JuliaPkgs/ForwardDiff.jl tags/v0.10.23*> julia -q --project

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .text
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vmovsd  (%rsi), %xmm2                   # xmm2 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vmovsd  8(%rsi), %xmm1                  # xmm1 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
; │┌ @ float.jl:452 within `isfinite'
; ││┌ @ float.jl:329 within `-'
        vsubsd  %xmm0, %xmm0, %xmm3
        vxorpd  %xmm4, %xmm4, %xmm4
; ││└
; ││┌ @ float.jl:401 within `==' @ float.jl:365
        vucomisd        %xmm4, %xmm3
; │└└
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vmovsd  16(%rsi), %xmm3                 # xmm3 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
        jne     L33
        jnp     L72
L33:
        vucomisd        %xmm4, %xmm1
        jne     L72
        jp      L72
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vxorpd  %xmm5, %xmm5, %xmm5
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
        vucomisd        %xmm5, %xmm3
        jne     L72
        jp      L72
; │┌ @ partials.jl:37 within `iszero'
; ││┌ @ partials.jl:167 within `iszero_tuple'
; │││┌ @ partials.jl:172 within `macro expansion'
; ││││┌ @ float.jl:365 within `=='
        vmovsd  24(%rsi), %xmm4                 # xmm4 = mem[0],zero
....

If I run with the versioned:

(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_p9UUFZ/Project.toml`

(jl_p9UUFZ) pkg> add ForwardDiff@0.10.23

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .text
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vbroadcastsd    %xmm0, %ymm1
        vmulpd  (%rsi), %ymm1, %ymm1
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:111
; │┌ @ partials.jl:200 within `scale_tuple'
; ││┌ @ partials.jl:157 within `macro expansion'
; │││┌ @ float.jl:332 within `*'
        vmulsd  32(%rsi), %xmm0, %xmm0
; │└└└
; │ @ dual.jl:244 within `*'
        vmovupd %ymm1, (%rdi)
        vmovsd  %xmm0, 32(%rdi)
        vzeroupper
        retq
        nop
; └

Why is the local one so trash? Can you repro that..?

@YingboMa
Copy link
Member Author

YingboMa commented Nov 19, 2021

I get:

julia> @code_native debuginfo=:none d * 3.0
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmulsd  (%rsi), %xmm0, %xmm1
        vbroadcastsd    %xmm0, %ymm0
        vmulpd  8(%rsi), %ymm0, %ymm0
        vmovsd  %xmm1, (%rdi)
        vmovupd %ymm0, 8(%rdi)
        vzeroupper
        retq
        nop

I am not saying that the contract is the actual cause, I am saying it could be. Or one of the other fast math flags introduced. Or something else. But it is not magic.

What do you mean by magic? SLP is not magic. So maybe you can also accept that LLVM cannot always somehow deduce the way to group unrolled expression into optimal SIMD code. Here's the benchmark without flags with @btime (just for you :-P)

➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.979201 seconds (4.80 M allocations: 243.740 MiB, 13.99% gc time, 99.99% compilation time)
  833.165 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  1.007542 seconds (4.80 M allocations: 243.740 MiB, 14.13% gc time, 99.99% compilation time)
  842.387 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.991858 seconds (4.80 M allocations: 243.740 MiB, 13.58% gc time, 99.99% compilation time)
  773.643 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  1.106445 seconds (4.68 M allocations: 241.588 MiB, 20.34% gc time, 99.99% compilation time)
  902.947 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  0.952378 seconds (4.68 M allocations: 241.588 MiB, 13.72% gc time, 99.99% compilation time)
  903.885 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  0.968903 seconds (4.68 M allocations: 241.588 MiB, 13.50% gc time, 99.99% compilation time)
  921.333 ns (0 allocations: 0 bytes)

Indeed, I have a noisy computer. But the speed up is still there.

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

I get:

Ok, I had

❯ cat LocalPreferences.toml
[ForwardDiff]
nansafe_mode = true

locally, which apparently trashes the whole thing.

Here's the benchmark without flags with @btime (just for you :-P)

Thanks, just look how easy that was to read :).

But the speed up is still there.

Yes, let's figure out why :)

@KristofferC
Copy link
Collaborator

With this branch, locally I get:

❯ julia --project ~/JuliaTests/fdiffbench.jl
  0.770050 seconds (4.75 M allocations: 260.031 MiB, 7.25% gc time, 99.99% compilation time)
  995.077 ns (0 allocations: 0 bytes)

If I remove the changes to div_tuple_by_scalar I get:

❯ julia --project ~/JuliaTests/fdiffbench.jl
  0.743958 seconds (4.75 M allocations: 260.048 MiB, 7.28% gc time, 99.99% compilation time)
  875.692 ns (0 allocations: 0 bytes)

Can you repro that?

@YingboMa
Copy link
Member Author

I cannot observe the difference:

➜  ForwardDiff git:(myb/vec) julia --startup-file=no ~/pollu.jl
  1.128815 seconds (4.80 M allocations: 243.678 MiB, 13.70% gc time, 99.99% compilation time)
  827.237 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.971470 seconds (4.80 M allocations: 243.692 MiB, 13.62% gc time, 99.99% compilation time)
  831.758 ns (0 allocations: 0 bytes)

In fact, I changed the div function to

    return tupexpr(i -> :((println("Hello")); return tup[$i] / x), N)

and nothing was printed, so I am pretty sure it won't change anything :-p

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

and nothing was printed, so I am pretty sure it won't change anything :-p

Maybe it is some code layout effect then (https://easyperf.net/blog/2018/01/18/Code_alignment_issues) because it is quite consistent for me. I keep doing it over and over :P Or it was noise. 🤷‍♂️

Do you get an improvement from the fast math annotations that is added in the LLVM IR?

@KristofferC
Copy link
Collaborator

KristofferC commented Nov 19, 2021

I seem to get the same performance (and the same number of instructions using)

using SIMD

function scale_tuple(tup::NTuple{N}, x) where N
    return @fastmath Tuple(Vec(tup) * x)
end

function div_tuple_by_scalar(tup::NTuple{N}, x) where N
    return @fastmath Tuple(Vec(tup) / x)
end

function add_tuples(a::NTuple{N}, b::NTuple{N})  where N
    return @fastmath Tuple(Vec(a) + Vec(b))
end

function sub_tuples(a::NTuple{N}, b::NTuple{N})  where N
    return @fastmath Tuple(Vec(a) - Vec(b))
end

function minus_tuple(tup::NTuple{N}) where N
    return @fastmath Tuple(-Vec(tup))
end

function mul_tuples(a::NTuple{N,V1}, b::NTuple{N,V2}, afactor::S1, bfactor::S2) where {N,V1,V2,S1,S2}
    af = Vec{N,V1}(afactor)
    bf = Vec{N,V2}(bfactor)
    @fastmath Tuple(muladd(Vec(a), af, bf * Vec(b)))
end

Removing @fastmath adds like 10 instructions so probably not worth it. Being able to use the muladd in mul_tuples is nice, indeed.

KristofferC added a commit that referenced this pull request Nov 19, 2021
KristofferC added a commit that referenced this pull request Nov 19, 2021
KristofferC added a commit that referenced this pull request Nov 19, 2021
@KristofferC
Copy link
Collaborator

I put up a PR so it is easier to compare: #557

KristofferC added a commit that referenced this pull request Nov 19, 2021
KristofferC added a commit that referenced this pull request Nov 19, 2021
@KristofferC
Copy link
Collaborator

Incorporated into #557.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants