Manual vectorization of the accumulation of `Partials` #555

YingboMa · 2021-11-19T04:55:17Z

The original goal for manual vectorization was to decrease the compile time, as I thought SLP vectorizer could take a long time to run. Unfortunately, it doesn't seem to improve the compile time at all.

BUT!!! Just when I am about to stash my changes, I found out that this improves the runtime by about 20% for the Jacobian computation of a chemical differential equation system.

Master:

This branch:

The benchmark script:

using DiffEqBase, ForwardDiff, BenchmarkTools
const k1=.35e0
const k2=.266e2
const k3=.123e5
const k4=.86e-3
const k5=.82e-3
const k6=.15e5
const k7=.13e-3
const k8=.24e5
const k9=.165e5
const k10=.9e4
const k11=.22e-1
const k12=.12e5
const k13=.188e1
const k14=.163e5
const k15=.48e7
const k16=.35e-3
const k17=.175e-1
const k18=.1e9
const k19=.444e12
const k20=.124e4
const k21=.21e1
const k22=.578e1
const k23=.474e-1
const k24=.178e4
const k25=.312e1

function rhs!(dy,y,p,t)
    @inbounds begin
        r1  = k1 *y[1]
        r2  = k2 *y[2]*y[4]
        r3  = k3 *y[5]*y[2]
        r4  = k4 *y[7]
        r5  = k5 *y[7]
        r6  = k6 *y[7]*y[6]
        r7  = k7 *y[9]
        r8  = k8 *y[9]*y[6]
        r9  = k9 *y[11]*y[2]
        r10 = k10*y[11]*y[1]
        r11 = k11*y[13]
        r12 = k12*y[10]*y[2]
        r13 = k13*y[14]
        r14 = k14*y[1]*y[6]
        r15 = k15*y[3]
        r16 = k16*y[4]
        r17 = k17*y[4]
        r18 = k18*y[16]
        r19 = k19*y[16]
        r20 = k20*y[17]*y[6]
        r21 = k21*y[19]
        r22 = k22*y[19]
        r23 = k23*y[1]*y[4]
        r24 = k24*y[19]*y[1]
        r25 = k25*y[20]

        dy[1]  = -r1-r10-r14-r23-r24+
        r2+r3+r9+r11+r12+r22+r25
        dy[2]  = -r2-r3-r9-r12+r1+r21
        dy[3]  = -r15+r1+r17+r19+r22
        dy[4]  = -r2-r16-r17-r23+r15
        dy[5]  = -r3+r4+r4+r6+r7+r13+r20
        dy[6]  = -r6-r8-r14-r20+r3+r18+r18
        dy[7]  = -r4-r5-r6+r13
        dy[8]  = r4+r5+r6+r7
        dy[9]  = -r7-r8
        dy[10] = -r12+r7+r9
        dy[11] = -r9-r10+r8+r11
        dy[12] = r9
        dy[13] = -r11+r10
        dy[14] = -r13+r12
        dy[15] = r14
        dy[16] = -r18-r19+r16
        dy[17] = -r20
        dy[18] = r20
        dy[19] = -r21-r22-r24+r23+r25
        dy[20] = -r25+r24
    end
    nothing
end
f! = DiffEqBase.UJacobianWrapper(rhs!, 0.0, nothing)

u0 = zeros(20)
u0[2]  = 0.2
u0[4]  = 0.04
u0[7]  = 0.1
u0[8]  = 0.3
u0[9]  = 0.01
u0[17] = 0.007

J = similar(u0, length(u0), length(u0))
du = similar(u0)
cfg = ForwardDiff.JacobianConfig(f!, du, u0, ForwardDiff.Chunk(8))
#tinf = @snoopi_deep ForwardDiff.jacobian!(J, f!, du, u0, cfg)
@time ForwardDiff.jacobian!(J, f!, du, u0, cfg)
display(@benchmark ForwardDiff.jacobian!($J, $f!, $du, $u0, $cfg))
println()

codecov-commenter · 2021-11-19T05:06:09Z

Codecov Report

Merging #555 (e50d653) into master (0af523a) will increase coverage by 0.94%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #555      +/-   ##
==========================================
+ Coverage   85.07%   86.01%   +0.94%     
==========================================
  Files           9        9              
  Lines         844      901      +57     
==========================================
+ Hits          718      775      +57     
  Misses        126      126

Impacted Files	Coverage Δ
src/partials.jl	`89.53% <100.00%> (+5.18%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0af523a...e50d653. Read the comment docs.

YingboMa · 2021-11-19T05:48:35Z

With chunk size = 4, there is about 10% improvement.

Master:

This branch:

KristofferC · 2021-11-19T07:13:29Z

Is the speed improvement here just from the injected fastmath flags? For example, ForwardDiff already auto-vectorizes seemingly ok:

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vbroadcastsd    %xmm0, %ymm1
        vmulpd  (%rsi), %ymm1, %ymm1
...

julia> @code_native d + d
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ dual.jl:139 within `+'
        movq    %rdi, %rax
; │ @ dual.jl:422 within `+' @ float.jl:326
        vmovupd (%rsi), %ymm0
        vaddpd  (%rdx), %ymm0, %ymm0
...

Did you try just slapping a @fastmath on the partial computations?

I think before this can be merged we need to figure out what optimization LLVM misses (or what is causing it not to be able to do that optimization) with the normal tuples and why going to llvmcall is strictly needed. Also, instead of hand-rolling our own LLVM SIMD everywhere, if this is really required, wouldn't SIMD.jl be easier to use?

Also, please use @btime unless the histogram and all other numbers are of any use (which it isn't here since we just want the value most close to the truth (which the minimum already shows best))

YingboMa · 2021-11-19T08:06:53Z

For example, ForwardDiff already auto-vectorizes seemingly ok:

@code_native on a small expression is highly misleading. How well SLP vectorizer performs depends on the context, if that is inlined in some complex code with indexing and such, it won't produce the same code quality.

Is the speed improvement here just from the injected fastmath flags?

It is not:

➜  ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
  1.289900 seconds (4.68 M allocations: 241.667 MiB, 18.83% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 73 evaluations.
 Range (min … max):  870.356 ns …   3.028 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     890.062 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   911.691 ns ± 113.366 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █▇▃▂▁▂▂  ▁                                                   ▁
  ████████▇██▇▇▇▆▆▆▅▅▆▃▄▅▆▅▅▅▄▄▄▃▃▅▃▄▄▅▅▅▅▄▄▃▄▅▄▃▄▂▃▂▂▃▅▂▃▃▂▄▃▅ █
  870 ns        Histogram: log(frequency) by time       1.47 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
➜  ForwardDiff git:(master) julia --math-mode=fast --startup-file=no ~/pollu.jl
  1.137902 seconds (4.68 M allocations: 241.667 MiB, 13.23% gc time, 99.99% compilation time)
BenchmarkTools.Trial: 10000 samples with 71 evaluations.
 Range (min … max):  882.451 ns …   3.489 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     895.127 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   918.756 ns ± 126.937 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▄▃▁▁▂  ▁                                                    ▁
  ███████▇██▇▆▆▅▇▅▄▆▄▃▅▅▅▅▂▄▃▄▅▃▄▅▅▅▅▄▅▄▄▃▃▄▄▄▄▄▃▃▄▄▄▄▃▃▃▄▃▃▃▄▃ █
  882 ns        Histogram: log(frequency) by time       1.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Also, please use @Btime unless the histogram and all other numbers are of any use (which it isn't here since we just want the value most close to the truth (which the minimum already shows best))

Not necessarily, if you look at the histogram, most of the samples are no where near the min time. Why would you say min is the most close to the truth? If anything it's misleading to believe that this function would really run that fast in a real world setting. Also, since we need to make a cost-benefit analysis here, having more information certainly doesn't harm. (Also, I like those colors, thus, the screenshots. I posted text this time if you prefer that.) Further, I am not sure that changes in the native code don't shift the min/mean/median in unintuitive ways. For instance, maybe the min time is better, yet, the mean is much worse, leading to an overall pessimization in real world use cases.

Also, instead of hand-rolling our own LLVM SIMD everywhere, if this is really required, wouldn't SIMD.jl be easier to use?

I first tried to use VectorizationBase, but that leads to compile time regressions. I think it's simple enough that we should just emit the IR. Especially because ForwardDiff is so low in the dependency, we need to be careful about compile time regression.

KristofferC · 2021-11-19T08:33:48Z

@code_native on a small expression is highly misleading. How well SLP vectorizer performs depends on the context, if that is inlined in some complex code with indexing and such, it won't produce the same code quality.

The end result is still the optimization of the sum of the parts. For example here, the contract fast math option used here could perhaps make a difference when doing optimization on the whole.

Not necessarily, if you look at the histogram, most of the samples are no where near the min time. Why would you say min is the most close to the truth?

Yes, because your computer has noise which is always additive. The fastest run is the one with the least noise. We are not writing an HTTP server here where there is scheduling and we care about percentiles etc. The mode of the samples is close to the min (and the more samples you take the more the mode will move towards the min).

Also, I like those colors, thus, the screenshots

I agree that colors are good if they show something useful.

YingboMa · 2021-11-19T08:43:15Z

The end result is still the optimization of the sum of the parts. For example here, the contract fast math option used here could perhaps make a difference when doing optimization on the whole.

That's not true. If that were the case, --math-mode=fast should speed the master branch to this branch's speed as well. If you actually see the native code for the rhs function, the manual vectorization is better. Also, the speed up is proven empirically, so I don't see the dispute. SLP objectively produces worse code if there's complex code surrounding the unrolled part as shown in the benchmark.

Maybe you can also run the benchmark to see if there's a speed up.

Yes, because your computer has noise which is always additive. The fastest run is the one with the least noise. We are not writing an HTTP server here where there is scheduling and we care about percentiles etc. The mode of the samples is close to the min (and the more samples you take the more the mode will move towards the min).

ForwardDiff.jl doesn't run on perfect computers. I don't see how you can prove changes in native code cannot produce pathological cases that I mentioned in here.

Further, I am not sure that changes in the native code don't shift the min/mean/median in unintuitive ways. For instance, maybe the min time is better, yet, the mean is much worse, leading to an overall pessimization in real world use cases.

KristofferC · 2021-11-19T08:52:25Z

That's not true. If that were the case, --math-mode=fast should speed the master branch to this branch's speed as well. If you actually see the native code for the rhs function, the manual vectorization is better.

I am not saying that the contract is the actual cause, I am saying it could be. Or one of the other fast math flags introduced. Or something else. But it is not magic.

ForwardDiff.jl doesn't run on perfect computers. I don't see how you can prove changes in native code cannot produce pathological cases that I mentioned in here.

See https://youtu.be/vrfYLlR8X8k?t=915 and the following ~10 min. No one cares about the particular noise of your computer, we want to benchmark the algorithm as well as we can, that's the most unbiased result. Specifically https://youtu.be/vrfYLlR8X8k?t=1487

I am getting some really strange things. If I run with the local ForwardDiff:

~/JuliaPkgs/ForwardDiff.jl tags/v0.10.23*> julia -q --project

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .text
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vmovsd  (%rsi), %xmm2                   # xmm2 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vmovsd  8(%rsi), %xmm1                  # xmm1 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
; │┌ @ float.jl:452 within `isfinite'
; ││┌ @ float.jl:329 within `-'
        vsubsd  %xmm0, %xmm0, %xmm3
        vxorpd  %xmm4, %xmm4, %xmm4
; ││└
; ││┌ @ float.jl:401 within `==' @ float.jl:365
        vucomisd        %xmm4, %xmm3
; │└└
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vmovsd  16(%rsi), %xmm3                 # xmm3 = mem[0],zero
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
        jne     L33
        jnp     L72
L33:
        vucomisd        %xmm4, %xmm1
        jne     L72
        jp      L72
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:0
        vxorpd  %xmm5, %xmm5, %xmm5
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:95
        vucomisd        %xmm5, %xmm3
        jne     L72
        jp      L72
; │┌ @ partials.jl:37 within `iszero'
; ││┌ @ partials.jl:167 within `iszero_tuple'
; │││┌ @ partials.jl:172 within `macro expansion'
; ││││┌ @ float.jl:365 within `=='
        vmovsd  24(%rsi), %xmm4                 # xmm4 = mem[0],zero
....

If I run with the versioned:

(@v1.6) pkg> activate --temp
  Activating new environment at `/tmp/jl_p9UUFZ/Project.toml`

(jl_p9UUFZ) pkg> add ForwardDiff@0.10.23

julia> using ForwardDiff

julia> d = ForwardDiff.Dual(1.0, 2.0, 3.0, 4.0, 5.0);

julia> @code_native d * 3.0
        .text
; ┌ @ dual.jl:144 within `*'
        movq    %rdi, %rax
; │ @ dual.jl:243 within `*' @ float.jl:332
        vbroadcastsd    %xmm0, %ymm1
        vmulpd  (%rsi), %ymm1, %ymm1
; │ @ dual.jl:244 within `*' @ partials.jl:84 @ partials.jl:111
; │┌ @ partials.jl:200 within `scale_tuple'
; ││┌ @ partials.jl:157 within `macro expansion'
; │││┌ @ float.jl:332 within `*'
        vmulsd  32(%rsi), %xmm0, %xmm0
; │└└└
; │ @ dual.jl:244 within `*'
        vmovupd %ymm1, (%rdi)
        vmovsd  %xmm0, 32(%rdi)
        vzeroupper
        retq
        nop
; └

Why is the local one so trash? Can you repro that..?

YingboMa · 2021-11-19T09:06:06Z

I get:

julia> @code_native debuginfo=:none d * 3.0
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmulsd  (%rsi), %xmm0, %xmm1
        vbroadcastsd    %xmm0, %ymm0
        vmulpd  8(%rsi), %ymm0, %ymm0
        vmovsd  %xmm1, (%rdi)
        vmovupd %ymm0, 8(%rdi)
        vzeroupper
        retq
        nop

I am not saying that the contract is the actual cause, I am saying it could be. Or one of the other fast math flags introduced. Or something else. But it is not magic.

What do you mean by magic? SLP is not magic. So maybe you can also accept that LLVM cannot always somehow deduce the way to group unrolled expression into optimal SIMD code. Here's the benchmark without flags with @btime (just for you :-P)

➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.979201 seconds (4.80 M allocations: 243.740 MiB, 13.99% gc time, 99.99% compilation time)
  833.165 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  1.007542 seconds (4.80 M allocations: 243.740 MiB, 14.13% gc time, 99.99% compilation time)
  842.387 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.991858 seconds (4.80 M allocations: 243.740 MiB, 13.58% gc time, 99.99% compilation time)
  773.643 ns (0 allocations: 0 bytes)

➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  1.106445 seconds (4.68 M allocations: 241.588 MiB, 20.34% gc time, 99.99% compilation time)
  902.947 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  0.952378 seconds (4.68 M allocations: 241.588 MiB, 13.72% gc time, 99.99% compilation time)
  903.885 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(master) julia --startup-file=no ~/pollu.jl
  0.968903 seconds (4.68 M allocations: 241.588 MiB, 13.50% gc time, 99.99% compilation time)
  921.333 ns (0 allocations: 0 bytes)

Indeed, I have a noisy computer. But the speed up is still there.

KristofferC · 2021-11-19T09:09:25Z

I get:

Ok, I had

❯ cat LocalPreferences.toml
[ForwardDiff]
nansafe_mode = true

locally, which apparently trashes the whole thing.

Here's the benchmark without flags with @btime (just for you :-P)

Thanks, just look how easy that was to read :).

But the speed up is still there.

Yes, let's figure out why :)

KristofferC · 2021-11-19T09:22:42Z

With this branch, locally I get:

❯ julia --project ~/JuliaTests/fdiffbench.jl
  0.770050 seconds (4.75 M allocations: 260.031 MiB, 7.25% gc time, 99.99% compilation time)
  995.077 ns (0 allocations: 0 bytes)

If I remove the changes to div_tuple_by_scalar I get:

❯ julia --project ~/JuliaTests/fdiffbench.jl
  0.743958 seconds (4.75 M allocations: 260.048 MiB, 7.28% gc time, 99.99% compilation time)
  875.692 ns (0 allocations: 0 bytes)

Can you repro that?

YingboMa · 2021-11-19T09:38:17Z

I cannot observe the difference:

➜  ForwardDiff git:(myb/vec) julia --startup-file=no ~/pollu.jl
  1.128815 seconds (4.80 M allocations: 243.678 MiB, 13.70% gc time, 99.99% compilation time)
  827.237 ns (0 allocations: 0 bytes)
➜  ForwardDiff git:(myb/vec) ✗ julia --startup-file=no ~/pollu.jl
  0.971470 seconds (4.80 M allocations: 243.692 MiB, 13.62% gc time, 99.99% compilation time)
  831.758 ns (0 allocations: 0 bytes)

In fact, I changed the div function to

    return tupexpr(i -> :((println("Hello")); return tup[$i] / x), N)

and nothing was printed, so I am pretty sure it won't change anything :-p

KristofferC · 2021-11-19T09:48:42Z

and nothing was printed, so I am pretty sure it won't change anything :-p

Maybe it is some code layout effect then (https://easyperf.net/blog/2018/01/18/Code_alignment_issues) because it is quite consistent for me. I keep doing it over and over :P Or it was noise. 🤷‍♂️

Do you get an improvement from the fast math annotations that is added in the LLVM IR?

KristofferC · 2021-11-19T10:32:55Z

I seem to get the same performance (and the same number of instructions using)

using SIMD

function scale_tuple(tup::NTuple{N}, x) where N
    return @fastmath Tuple(Vec(tup) * x)
end

function div_tuple_by_scalar(tup::NTuple{N}, x) where N
    return @fastmath Tuple(Vec(tup) / x)
end

function add_tuples(a::NTuple{N}, b::NTuple{N})  where N
    return @fastmath Tuple(Vec(a) + Vec(b))
end

function sub_tuples(a::NTuple{N}, b::NTuple{N})  where N
    return @fastmath Tuple(Vec(a) - Vec(b))
end

function minus_tuple(tup::NTuple{N}) where N
    return @fastmath Tuple(-Vec(tup))
end

function mul_tuples(a::NTuple{N,V1}, b::NTuple{N,V2}, afactor::S1, bfactor::S2) where {N,V1,V2,S1,S2}
    af = Vec{N,V1}(afactor)
    bf = Vec{N,V2}(bfactor)
    @fastmath Tuple(muladd(Vec(a), af, bf * Vec(b)))
end

Removing @fastmath adds like 10 instructions so probably not worth it. Being able to use the muladd in mul_tuples is nice, indeed.

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC · 2021-11-19T11:01:17Z

I put up a PR so it is easier to compare: #557

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC · 2021-11-19T17:55:06Z

Incorporated into #557.

YingboMa added 2 commits November 18, 2021 22:45

Failed attempts

a348cec

Slightly better SIMD criterion

0a5df53

YingboMa added 2 commits November 19, 2021 00:18

Julia version bound for fneg

0068d28

Julia 1.0 doesn't get nice things

e50d653

YingboMa force-pushed the myb/vec branch from e5d6090 to e50d653 Compare November 19, 2021 05:34

KristofferC added a commit that referenced this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations

f72be8d

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC added a commit that referenced this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations

f8a4914

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC added a commit that referenced this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations

021797e

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC mentioned this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations #557

Merged

KristofferC added a commit that referenced this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations

261efa5

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC added a commit that referenced this pull request Nov 19, 2021

use SIMD.jl for explicit vectorization of partial operations

4466b78

Alternative to #555 Co-authored-by: Yingbo Ma <[email protected]>

KristofferC closed this Nov 19, 2021

YingboMa deleted the myb/vec branch November 20, 2021 06:18

KristofferC mentioned this pull request Nov 22, 2021

WIP/RFC make the partial scalars VecElements #563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manual vectorization of the accumulation of `Partials` #555

Manual vectorization of the accumulation of `Partials` #555

YingboMa commented Nov 19, 2021

codecov-commenter commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021

KristofferC commented Nov 19, 2021

Manual vectorization of the accumulation of Partials #555

Manual vectorization of the accumulation of Partials #555

Conversation

YingboMa commented Nov 19, 2021

codecov-commenter commented Nov 19, 2021 • edited Loading

Codecov Report

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 • edited Loading

YingboMa commented Nov 19, 2021 • edited Loading

KristofferC commented Nov 19, 2021 • edited Loading

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 • edited Loading

YingboMa commented Nov 19, 2021 • edited Loading

KristofferC commented Nov 19, 2021 • edited Loading

KristofferC commented Nov 19, 2021

YingboMa commented Nov 19, 2021

KristofferC commented Nov 19, 2021 • edited Loading

KristofferC commented Nov 19, 2021 • edited Loading

KristofferC commented Nov 19, 2021

KristofferC commented Nov 19, 2021

Manual vectorization of the accumulation of `Partials` #555

Manual vectorization of the accumulation of `Partials` #555

codecov-commenter commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

YingboMa commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading

KristofferC commented Nov 19, 2021 •

edited

Loading