-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some precompiles #29
Conversation
Together with a couple of changes to LoopVectorization, this shaves about one second off the initial `mygemmavx!` demo. There may be more methods that could be added, but this is a start. Overall, VectorizationBase is the only substantive source of inference time in that demo.
Codecov Report
@@ Coverage Diff @@
## master #29 +/- ##
==========================================
- Coverage 56.17% 55.53% -0.65%
==========================================
Files 28 28
Lines 2672 2685 +13
==========================================
- Hits 1501 1491 -10
- Misses 1171 1194 +23
Continue to review full report at Codecov.
|
For your amusement, here's the compiler-flamegraph that can be produced using the master branch of SnoopCompile: (The mouse pointer seems to have been mis-captured, it was actually in the middle of the wide blue bar.) "empty" spaces correspond to times when it's doing something other than inference. The thick stack on the right is what's left of inference needed for the final stage of codegen, mostly in VectorizationBase. As you pointed out on discourse, for LV inference time isn't dominant---and that's true even of the first run---mostly because you've done a good job of adding precompiles to LoopVectorization itself. The "skinny" flames are mostly due to JuliaLang/julia#38983, but in fact they don't add up to much. |
The remaining inference time, excluding things like LoopVectorization._avx_!(::Val{(0, 0, 0, 4)}, ::Type{Tuple{:numericconstant, ... which will never be practical to precompile, is around 1.9s. That's out of a total of ~10.5s on my machine for the entire benchmark, so there's still hope for more improvement though we're definitely into diminishing returns from inference itself. |
for I ∈ (Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64) | ||
precompile(vload_quote, (Type{T}, Type{I}, Symbol, Int, Int, Int, Int, Bool, Bool)) | ||
end | ||
# precompile(vfmadd, (Vec{4, T}, Vec{4, T}, Vec{4, T})) # doesn't "take" (too bad, this is expensive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea why it might not "take"?
Is there anything that can help?
The inference result for vfmadd(::Vec{W,T}, ::Vec{W,T}, ::Vec{W,T})
should trivially be Vec{W,T}
.
Would defining the llvmcall
functions to assert the return type in this manner help inference?
And, it also seems like inference should be easy for llvmcall(string_or_tuple, return_type, arg_types, args...)
(i.e., return_type
is the return type), so perhaps this is a Julia issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I knew. There are still some weird things about precompilation (e.g., JuliaLang/julia#38951) that need attention. This would be a prime candidate because this one MethodInstance
(and its callees) costs you something like 200ms, which is quite a lot. (It looks like even more than that on the flamegraph, but that's because of other bits of codegen happening in the middle while this call-tree is being inferred.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe the fact that codegen runs while the callees are being inferred could be related? No clue, really, just grasping for explanations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't know very little about Julia's internals, so it's hard for me to speculate what will/will not work.
E.g., if there's any inference short circuiting that could quickly tell the return type of a function like:
@inline function (muladd(v1::Vec{W,T}, v2::Vec{W,T}, v3::Vec{W,T})::Vec{W,T}) where {W,T}
_muladd(v1,v2,v3)
end
# define `_muladd` as the current definition of `muladd`, wrapping `llvmcall_expr`
# the `vfmadd` family calls `muladd`
that could then avoid descending into _muladd
and having to evaluate llvmcall_expr
to infer the (annotated) return type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. I would mostly just report this as an issue. I'm really hoping precompilation gets serious attention in Julia 1.7, we really are at the threshold of nuking the latency problem for many workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge?
Maybe I should run to see what they're specializing on. |
I am curious if a different design of internals, like with having type-asserted |
I just tried turning on the W = pick_vector_width(T) # the size we want to compile is type and CPU-dependent
precompile(vfmadd, (Vec{W, T}, Vec{W, T}, Vec{W, T})) # doesn't "take" (too bad, this is expensive) and got julia> using MethodAnalysis, AbstractTrees
julia> using VectorizationBase
[ Info: Precompiling VectorizationBase [3d5dd08c-fd9d-11e8-17fa-ed2836048c2f]
julia> mi = methodinstance(VectorizationBase.vfmadd, (Vec{8,Float64}, Vec{8,Float64}, Vec{8,Float64}))
MethodInstance for VectorizationBase.vfmadd(::Vec{8, Float64}, ::Vec{8, Float64}, ::Vec{8, Float64}) So it seems to have taken? |
Nice! Also check whether it shows up in the flamegraph: using SnoopCompile
tinf = @snoopi_deep whatever...
fg = flamegraph(tinf)
using ProfileView
ProfileView.view(fg) and then hover over various stacks, working your way in the call chain to the point where it might appear. |
VectorizationBase's test suite doesn't seem to be a good choice, as it's dominated by broadcasts. I should probably try replacing them with masks just for the sake of speeding up the test suite. The more important change for users, however, is that |
Yeah, I often find it's nice to start with a "single thing" (e.g., a single |
Together with a couple of changes to LoopVectorization,
this shaves about one second off the initial
mygemmavx!
demo.There may be more methods that could be added, but this is a start.
Overall, VectorizationBase is the only substantive source of
inference time in that demo.