Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some precompiles #29

Merged
merged 1 commit into from
Dec 24, 2020
Merged

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Dec 24, 2020

Together with a couple of changes to LoopVectorization,
this shaves about one second off the initial mygemmavx! demo.

There may be more methods that could be added, but this is a start.
Overall, VectorizationBase is the only substantive source of
inference time in that demo.

Together with a couple of changes to LoopVectorization,
this shaves about one second off the initial `mygemmavx!` demo.

There may be more methods that could be added, but this is a start.
Overall, VectorizationBase is the only substantive source of
inference time in that demo.
@codecov-io
Copy link

codecov-io commented Dec 24, 2020

Codecov Report

Merging #29 (67105ae) into master (739e614) will decrease coverage by 0.64%.
The diff coverage is 92.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #29      +/-   ##
==========================================
- Coverage   56.17%   55.53%   -0.65%     
==========================================
  Files          28       28              
  Lines        2672     2685      +13     
==========================================
- Hits         1501     1491      -10     
- Misses       1171     1194      +23     
Impacted Files Coverage Δ
src/precompile.jl 86.66% <92.30%> (+36.66%) ⬆️
src/llvm_intrin/memory_addr.jl 52.70% <0.00%> (-3.19%) ⬇️
src/llvm_intrin/binary_ops.jl 53.33% <0.00%> (-1.67%) ⬇️
src/llvm_intrin/masks.jl 66.54% <0.00%> (-0.37%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 739e614...67105ae. Read the comment docs.

@timholy
Copy link
Contributor Author

timholy commented Dec 24, 2020

For your amusement, here's the compiler-flamegraph that can be produced using the master branch of SnoopCompile:

image

(The mouse pointer seems to have been mis-captured, it was actually in the middle of the wide blue bar.)

"empty" spaces correspond to times when it's doing something other than inference. The thick stack on the right is what's left of inference needed for the final stage of codegen, mostly in VectorizationBase. As you pointed out on discourse, for LV inference time isn't dominant---and that's true even of the first run---mostly because you've done a good job of adding precompiles to LoopVectorization itself.

The "skinny" flames are mostly due to JuliaLang/julia#38983, but in fact they don't add up to much.

@timholy
Copy link
Contributor Author

timholy commented Dec 24, 2020

xref JuliaSIMD/LoopVectorization.jl#180

@timholy
Copy link
Contributor Author

timholy commented Dec 24, 2020

The remaining inference time, excluding things like

LoopVectorization._avx_!(::Val{(0, 0, 0, 4)}, ::Type{Tuple{:numericconstant, ...

which will never be practical to precompile, is around 1.9s. That's out of a total of ~10.5s on my machine for the entire benchmark, so there's still hope for more improvement though we're definitely into diminishing returns from inference itself.

for I ∈ (Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64)
precompile(vload_quote, (Type{T}, Type{I}, Symbol, Int, Int, Int, Int, Bool, Bool))
end
# precompile(vfmadd, (Vec{4, T}, Vec{4, T}, Vec{4, T})) # doesn't "take" (too bad, this is expensive)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why it might not "take"?
Is there anything that can help?
The inference result for vfmadd(::Vec{W,T}, ::Vec{W,T}, ::Vec{W,T}) should trivially be Vec{W,T}.

Would defining the llvmcall functions to assert the return type in this manner help inference?

And, it also seems like inference should be easy for llvmcall(string_or_tuple, return_type, arg_types, args...) (i.e., return_type is the return type), so perhaps this is a Julia issue?

Copy link
Contributor Author

@timholy timholy Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew. There are still some weird things about precompilation (e.g., JuliaLang/julia#38951) that need attention. This would be a prime candidate because this one MethodInstance (and its callees) costs you something like 200ms, which is quite a lot. (It looks like even more than that on the flamegraph, but that's because of other bits of codegen happening in the middle while this call-tree is being inferred.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe the fact that codegen runs while the callees are being inferred could be related? No clue, really, just grasping for explanations.

Copy link
Member

@chriselrod chriselrod Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know very little about Julia's internals, so it's hard for me to speculate what will/will not work.

E.g., if there's any inference short circuiting that could quickly tell the return type of a function like:

@inline function (muladd(v1::Vec{W,T}, v2::Vec{W,T}, v3::Vec{W,T})::Vec{W,T}) where {W,T}
    _muladd(v1,v2,v3)
end
# define `_muladd` as the current definition of `muladd`, wrapping `llvmcall_expr`
# the `vfmadd` family calls `muladd`

that could then avoid descending into _muladd and having to evaluate llvmcall_expr to infer the (annotated) return type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. I would mostly just report this as an issue. I'm really hoping precompilation gets serious attention in Julia 1.7, we really are at the threshold of nuking the latency problem for many workloads.

Copy link
Member

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge?

@chriselrod
Copy link
Member

The "skinny" flames are mostly due to JuliaLang/julia#38983, but in fact they don't add up to much.

Maybe I should run to see what they're specializing on.
LoopVectorization and VectorizationBase make heavy use of ArrayInterface.StaticInt, but this should be creating separate methods for each constant value.
I'm curious which constant-values are seeing lots of specializations. If they're integers (and they must be?), maybe using StaticInts there would at least give them separate entries in the method table to avoid repeated inference on the same values?

@chriselrod
Copy link
Member

chriselrod commented Dec 24, 2020

I am curious if a different design of internals, like with having type-asserted muladd wrapping _muladd could improve the fairly heavy inference times we're seeing now, or if there're any place inference short circuits we could take advantage of, but that sort of restructuring would be a bigger change for another PR.

@chriselrod chriselrod merged commit 77caf46 into JuliaSIMD:master Dec 24, 2020
@timholy timholy deleted the teh/precompile branch December 24, 2020 17:33
@chriselrod
Copy link
Member

I just tried turning on the vfmadd statement:

W = pick_vector_width(T) # the size we want to compile is type and CPU-dependent
precompile(vfmadd, (Vec{W, T}, Vec{W, T}, Vec{W, T}))  # doesn't "take" (too bad, this is expensive)

and got

julia> using MethodAnalysis, AbstractTrees

julia> using VectorizationBase
[ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]

julia> mi = methodinstance(VectorizationBase.vfmadd, (Vec{8,Float64}, Vec{8,Float64}, Vec{8,Float64}))
MethodInstance for VectorizationBase.vfmadd(::Vec{8, Float64}, ::Vec{8, Float64}, ::Vec{8, Float64})

So it seems to have taken?

@timholy
Copy link
Contributor Author

timholy commented Jan 1, 2021

Nice! Also check whether it shows up in the flamegraph:

using SnoopCompile
tinf = @snoopi_deep whatever...
fg = flamegraph(tinf)
using ProfileView
ProfileView.view(fg)

and then hover over various stacks, working your way in the call chain to the point where it might appear.

@chriselrod
Copy link
Member

VectorizationBase's test suite doesn't seem to be a good choice, as it's dominated by broadcasts. I should probably try replacing them with masks just for the sake of speeding up the test suite.

The more important change for users, however, is that llvmcall functions shouldn't overload methods from other libraries (e.g. Base), and then define these overloads as calls to the VectorizationBase functions. That should increase the amount of inference results getting saved.

@timholy
Copy link
Contributor Author

timholy commented Jan 2, 2021

Yeah, I often find it's nice to start with a "single thing" (e.g., a single @avxed function) and explore that. Test suites are a bit messy for this kind of analysis because they have their own costs. That said, once you get the hang of this, they can be quite informative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants