Add some precompiles #29

timholy · 2020-12-24T13:30:03Z

Together with a couple of changes to LoopVectorization,
this shaves about one second off the initial mygemmavx! demo.

There may be more methods that could be added, but this is a start.
Overall, VectorizationBase is the only substantive source of
inference time in that demo.

Together with a couple of changes to LoopVectorization, this shaves about one second off the initial `mygemmavx!` demo. There may be more methods that could be added, but this is a start. Overall, VectorizationBase is the only substantive source of inference time in that demo.

codecov-io · 2020-12-24T13:32:55Z

Codecov Report

Merging #29 (67105ae) into master (739e614) will decrease coverage by 0.64%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##           master      #29      +/-   ##
==========================================
- Coverage   56.17%   55.53%   -0.65%     
==========================================
  Files          28       28              
  Lines        2672     2685      +13     
==========================================
- Hits         1501     1491      -10     
- Misses       1171     1194      +23

Impacted Files	Coverage Δ
src/precompile.jl	`86.66% <92.30%> (+36.66%)`	⬆️
src/llvm_intrin/memory_addr.jl	`52.70% <0.00%> (-3.19%)`	⬇️
src/llvm_intrin/binary_ops.jl	`53.33% <0.00%> (-1.67%)`	⬇️
src/llvm_intrin/masks.jl	`66.54% <0.00%> (-0.37%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 739e614...67105ae. Read the comment docs.

timholy · 2020-12-24T13:35:46Z

For your amusement, here's the compiler-flamegraph that can be produced using the master branch of SnoopCompile:

(The mouse pointer seems to have been mis-captured, it was actually in the middle of the wide blue bar.)

"empty" spaces correspond to times when it's doing something other than inference. The thick stack on the right is what's left of inference needed for the final stage of codegen, mostly in VectorizationBase. As you pointed out on discourse, for LV inference time isn't dominant---and that's true even of the first run---mostly because you've done a good job of adding precompiles to LoopVectorization itself.

The "skinny" flames are mostly due to JuliaLang/julia#38983, but in fact they don't add up to much.

timholy · 2020-12-24T13:49:07Z

xref JuliaSIMD/LoopVectorization.jl#180

timholy · 2020-12-24T14:00:35Z

The remaining inference time, excluding things like

LoopVectorization._avx_!(::Val{(0, 0, 0, 4)}, ::Type{Tuple{:numericconstant, ...

which will never be practical to precompile, is around 1.9s. That's out of a total of ~10.5s on my machine for the entire benchmark, so there's still hope for more improvement though we're definitely into diminishing returns from inference itself.

chriselrod · 2020-12-24T14:28:57Z

src/precompile.jl

+        for I ∈ (Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64)
+            precompile(vload_quote, (Type{T}, Type{I}, Symbol, Int, Int, Int, Int, Bool, Bool))
+        end
+        # precompile(vfmadd, (Vec{4, T}, Vec{4, T}, Vec{4, T}))  # doesn't "take" (too bad, this is expensive)


Any idea why it might not "take"?
Is there anything that can help?
The inference result for vfmadd(::Vec{W,T}, ::Vec{W,T}, ::Vec{W,T}) should trivially be Vec{W,T}.

Would defining the llvmcall functions to assert the return type in this manner help inference?

And, it also seems like inference should be easy for llvmcall(string_or_tuple, return_type, arg_types, args...) (i.e., return_type is the return type), so perhaps this is a Julia issue?

I wish I knew. There are still some weird things about precompilation (e.g., JuliaLang/julia#38951) that need attention. This would be a prime candidate because this one MethodInstance (and its callees) costs you something like 200ms, which is quite a lot. (It looks like even more than that on the flamegraph, but that's because of other bits of codegen happening in the middle while this call-tree is being inferred.)

Actually, maybe the fact that codegen runs while the callees are being inferred could be related? No clue, really, just grasping for explanations.

I still don't know very little about Julia's internals, so it's hard for me to speculate what will/will not work.

E.g., if there's any inference short circuiting that could quickly tell the return type of a function like:

@inline function (muladd(v1::Vec{W,T}, v2::Vec{W,T}, v3::Vec{W,T})::Vec{W,T}) where {W,T} _muladd(v1,v2,v3) end # define `_muladd` as the current definition of `muladd`, wrapping `llvmcall_expr` # the `vfmadd` family calls `muladd`

that could then avoid descending into _muladd and having to evaluate llvmcall_expr to infer the (annotated) return type?

Not sure. I would mostly just report this as an issue. I'm really hoping precompilation gets serious attention in Julia 1.7, we really are at the threshold of nuking the latency problem for many workloads.

chriselrod

Good to merge?

chriselrod · 2020-12-24T15:14:53Z

The "skinny" flames are mostly due to JuliaLang/julia#38983, but in fact they don't add up to much.

Maybe I should run to see what they're specializing on.
LoopVectorization and VectorizationBase make heavy use of ArrayInterface.StaticInt, but this should be creating separate methods for each constant value.
I'm curious which constant-values are seeing lots of specializations. If they're integers (and they must be?), maybe using StaticInts there would at least give them separate entries in the method table to avoid repeated inference on the same values?

chriselrod · 2020-12-24T15:16:26Z

I am curious if a different design of internals, like with having type-asserted muladd wrapping _muladd could improve the fairly heavy inference times we're seeing now, or if there're any place inference short circuits we could take advantage of, but that sort of restructuring would be a bigger change for another PR.

chriselrod · 2021-01-01T07:22:08Z

I just tried turning on the vfmadd statement:

W = pick_vector_width(T) # the size we want to compile is type and CPU-dependent
precompile(vfmadd, (Vec{W, T}, Vec{W, T}, Vec{W, T}))  # doesn't "take" (too bad, this is expensive)

and got

julia> using MethodAnalysis, AbstractTrees

julia> using VectorizationBase
[ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]

julia> mi = methodinstance(VectorizationBase.vfmadd, (Vec{8,Float64}, Vec{8,Float64}, Vec{8,Float64}))
MethodInstance for VectorizationBase.vfmadd(::Vec{8, Float64}, ::Vec{8, Float64}, ::Vec{8, Float64})

So it seems to have taken?

timholy · 2021-01-01T08:51:56Z

Nice! Also check whether it shows up in the flamegraph:

using SnoopCompile
tinf = @snoopi_deep whatever...
fg = flamegraph(tinf)
using ProfileView
ProfileView.view(fg)

and then hover over various stacks, working your way in the call chain to the point where it might appear.

chriselrod · 2021-01-01T09:37:18Z

VectorizationBase's test suite doesn't seem to be a good choice, as it's dominated by broadcasts. I should probably try replacing them with masks just for the sake of speeding up the test suite.

The more important change for users, however, is that llvmcall functions shouldn't overload methods from other libraries (e.g. Base), and then define these overloads as calls to the VectorizationBase functions. That should increase the amount of inference results getting saved.

timholy · 2021-01-02T16:00:26Z

Yeah, I often find it's nice to start with a "single thing" (e.g., a single @avxed function) and explore that. Test suites are a bit messy for this kind of analysis because they have their own costs. That said, once you get the hang of this, they can be quite informative.

DilumAluthge approved these changes Dec 24, 2020

View reviewed changes

timholy mentioned this pull request Dec 24, 2020

Some inferrability and precompile improvements JuliaSIMD/LoopVectorization.jl#180

Merged

chriselrod reviewed Dec 24, 2020

View reviewed changes

chriselrod approved these changes Dec 24, 2020

View reviewed changes

chriselrod merged commit 77caf46 into JuliaSIMD:master Dec 24, 2020

timholy deleted the teh/precompile branch December 24, 2020 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some precompiles #29

Add some precompiles #29

timholy commented Dec 24, 2020

codecov-io commented Dec 24, 2020 •

edited

Loading

timholy commented Dec 24, 2020 •

edited

Loading

timholy commented Dec 24, 2020

timholy commented Dec 24, 2020

chriselrod Dec 24, 2020

timholy Dec 24, 2020 •

edited

Loading

timholy Dec 24, 2020

chriselrod Dec 24, 2020 •

edited

Loading

timholy Dec 24, 2020

chriselrod left a comment

chriselrod commented Dec 24, 2020

chriselrod commented Dec 24, 2020 •

edited

Loading

chriselrod commented Jan 1, 2021

timholy commented Jan 1, 2021

chriselrod commented Jan 1, 2021

timholy commented Jan 2, 2021

Add some precompiles #29

Add some precompiles #29

Conversation

timholy commented Dec 24, 2020

codecov-io commented Dec 24, 2020 • edited Loading

Codecov Report

timholy commented Dec 24, 2020 • edited Loading

timholy commented Dec 24, 2020

timholy commented Dec 24, 2020

chriselrod Dec 24, 2020

Choose a reason for hiding this comment

timholy Dec 24, 2020 • edited Loading

Choose a reason for hiding this comment

timholy Dec 24, 2020

Choose a reason for hiding this comment

chriselrod Dec 24, 2020 • edited Loading

Choose a reason for hiding this comment

timholy Dec 24, 2020

Choose a reason for hiding this comment

chriselrod left a comment

Choose a reason for hiding this comment

chriselrod commented Dec 24, 2020

chriselrod commented Dec 24, 2020 • edited Loading

chriselrod commented Jan 1, 2021

timholy commented Jan 1, 2021

chriselrod commented Jan 1, 2021

timholy commented Jan 2, 2021

codecov-io commented Dec 24, 2020 •

edited

Loading

timholy commented Dec 24, 2020 •

edited

Loading

timholy Dec 24, 2020 •

edited

Loading

chriselrod Dec 24, 2020 •

edited

Loading

chriselrod commented Dec 24, 2020 •

edited

Loading