-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
precompile
interacts badly with Const-specialization
#38983
Comments
Along the lines of this issue and #38080, the new SnoopCompile 2.2.0 has a not-yet-documented "profile-guided deoptimization" toolset. Here's a demo running on Franklin: Each dot is a Method. You can see the inference time (including for the MethodInstance's callees) vs the amount of "self" time (not including callees) it spent at runtime. Some of the ones with more inference time are very cheap to infer once but are inferred a whole bunch of times for different constants (e.g., |
Here's a concrete example of the impact of this issue: the kwfunc for this method in DataFrames ends up taking a whopping 23s of inference time while running the DataFrames tests (the total test suite is ~700s, so just inference on just that one method is ~3% of the total time). It gets inferred 1778 separate times. Of the first 4 arguments of the kwfunc, (::DataFrames.var"#_combine_prepare##kw")(::Any, ::typeof(DataFrames._combine_prepare), gd::GroupedDataFrame, cs) the number of distinct slottypes (including 705
1
1
573 So the Boolean kwargs account for a ton of re-inference, and despite the
where A small sample of the diversity of the NamedTuple is
and that of the julia> arg4
Base.IdSet{Any} with 573 elements:
Tuple{Pair{Vector{Symbol}, Pair{Main.TestGrouping.var"#367#376"{Int64, Int64, UnitRange{Int64}}, Symbol}}}
Tuple{Pair{Vector{Int64}, Pair{Main.TestGrouping.var"#198#275", Symbol}}}
Core.PartialStruct(Tuple{Colon, Vararg{Union{Regex, AbstractString, Function, Signed, Symbol, Unsigned, Pair, AbstractVector{T} where T, Type, All, Between, Cols, InvertedIndex}, N} where N}, Any[Core…
Tuple{Pair{Symbol, Symbol}, Pair{Symbol, ByRow{typeof(sin)}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#510#522", Symbol}}, Pair{Symbol, Pair{Main.TestGrouping.var"#511#523", Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#573#587"{typeof(sum)}, Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#581#595"{ComposedFunction{typeof(prod), typeof(skipmissing)}}, Symbol}}}
Tuple{Pair{Symbol, Pair{Main.TestGrouping.var"#278#293"{typeof(maximum)}, Symbol}}}
Core.Const((Main.TestGrouping.var"#633#673"(),))
⋮ Here's the sorted set (essentially the cumulative distribution) of inference times for each of these runs: so you can see there are quite a few that take non-negligible time. |
Now that #38080 is merged, I'd like to propose that we turn off const-prop on keyword arguments by default; anyone who wants it can annotate with |
Just investigated a case where precompilation was not as effective as I hoped it would be and reduced it until I realized it was this issue. Here is the MWE: module Foo
bar(x::Int) = hcat(rand())
foo(x) = bar(1)
#foo(x) = bar(x) # this precompiles just fine
@assert precompile(foo, (Int,))
end # module Then, in a fresh session: julia> first(methods(hcat, Tuple{Number})).specializations # none yet
svec()
julia> using Foo
julia> first(methods(Foo.foo)).specializations
svec(MethodInstance for Foo.foo(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> first(methods(Foo.bar)).specializations
svec(MethodInstance for Foo.bar(::Int64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> first(methods(hcat, Tuple{Number})).specializations # contains hcat(::Float64) from Foo
svec(MethodInstance for hcat(::Float64), nothing, nothing, nothing, nothing, nothing, nothing, nothing)
julia> using SnoopCompileCore
julia> tinf = @snoopi_deep Foo.foo(1);
julia> using SnoopCompile, AbstractTrees
julia> print_tree(tinf, maxdepth=3)
InferenceTimingNode: 0.018412/0.044012 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000297/0.025600 on Foo.foo(::Int64) with 1 direct children
└─ InferenceTimingNode: 0.000358/0.025303 on Foo.bar(1::Int64) with 2 direct children
├─ InferenceTimingNode: 0.000404/0.003161 on rand() with 1 direct children
│ ⋮
│
└─ InferenceTimingNode: 0.000777/0.021783 on hcat(::Float64) with 13 direct children
⋮ So precompilation has generated the expected |
Funny you should ask. I got tired of not really understanding this well myself, so I am partway through writing PrecompileCacheUtils.jl, with the aim of providing a non-instantiating mirror of the content in the *.ji file (essentially like We could add a macro Meanwhile, to get a sense for how common this issue is, I'm contemplating special coloration in the flamegraphs to highlight constprop (maybe orange just like we use for gc in performance profiling). Thoughts? |
I'm still trying to get a handle on what the problem is, exactly. To systematically go through different cases, I slightly changed above module foo() = bar(1)
foo(x) = bar(x) and then added precompiles for eitherr of them or both and also ran either of them or both under Here is what I got (:x: means
(In the last two columns, the first result is of course redundant with the respective column before.)
However, calling So if I may venture a generalization, it looks like this: When inference is within a callchain that involves constant propagation, it cannot use inference results from precompile files, only those produced in the running session (and probably those from the sysimg?). Does that comply with your observations? |
Very interesting! One thing to check my understanding, when you say "was inferred" you mean "had previously been inferred so that it didn't need to infer it again when I ran it," right? Moreover, in all cases you're loading the module from a cache file ( You conclusion makes sense, and in particular suggests that this is not a limitation of our cache files but instead a constraint we might need on the current running session. |
One other thing we don't do is cache I also wonder if #41328 with |
You mean in " (:x: means
Correct.
So definition in REPL with precompilation works just as fine as running without precompilation.
So to avoid any misunderstanding, this is what I see, if I interpret my results correctly:
Where I assume that if
Ah. So it's actually the reuse of
If our analysis is correct up to here, then ... yes, that sounds useful.
Right, but I should be able to conduct the same experiment with
right? Will do and report back... |
Hm, no, inlining there doesn't matter, but the extra indirection helps. My new module: module Foo
# as before
bar(x::Int) = hcat(rand())
# with extra indrection
@inline bar() = hcat(rand())
bar(x::Float64) = bar()
foo1() = bar(1) # previously known as foo
foo2() = bar(1.0) # with extra indirection
@assert precompile(foo1, ())
@assert precompile(foo2, ())
end # module (Again defined in a package and loaded from a precompile file). julia> print_tree(@snoopi_deep(Foo.foo1()), maxdepth=3)
InferenceTimingNode: 0.018147/0.045257 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000343/0.027110 on Foo.foo1() with 1 direct children
└─ InferenceTimingNode: 0.000385/0.026768 on Foo.bar(1::Int64) with 2 direct children
├─ InferenceTimingNode: 0.000289/0.002451 on rand() with 1 direct children
│ ⋮
│
└─ InferenceTimingNode: 0.000925/0.023932 on hcat(::Float64) with 13 direct children
⋮ This was to be expected given everything discussed above. But now in a fresh session: julia> print_tree(@snoopi_deep(Foo.foo2()), maxdepth=3)
InferenceTimingNode: 0.019031/0.020039 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000323/0.001008 on Foo.foo2() with 2 direct children
├─ InferenceTimingNode: 0.000412/0.000412 on Foo.bar(::Float64) with 0 direct children
└─ InferenceTimingNode: 0.000273/0.000273 on Foo.bar(1.0::Float64) with 0 direct children (Replacing the No sure what to make of it. |
Trying to drill a little deeper, I've extended above foo3() = bar()
foo4() = hcat(rand()) and their respective julia> print_tree(@snoopi_deep(Foo.foo3()), maxdepth=3)
InferenceTimingNode: 0.019327/0.019715 on Core.Compiler.Timings.ROOT() with 1 direct children
└─ InferenceTimingNode: 0.000388/0.000388 on Foo.foo3() with 0 direct children and julia> print_tree(@snoopi_deep(Foo.foo4()), maxdepth=3)
InferenceTimingNode: 0.018091/0.018091 on Core.Compiler.Timings.ROOT() with 0 direct children (each in a fresh session). So julia> visit(Foo) do item
if item isa Core.CodeInstance
println(item.def, ": ", repr(item.max_world))
end
return item isa Union{Module,Method,Function,Core.MethodInstance,Core.CodeInstance}
end
MethodInstance for Foo.bar(): 0xffffffffffffffff
MethodInstance for Foo.bar(::Int64): 0xffffffffffffffff
MethodInstance for Foo.bar(::Float64): 0x0000000000000000
MethodInstance for Foo.foo1(): 0x0000000000000000
MethodInstance for Foo.foo2(): 0x0000000000000000
MethodInstance for Foo.foo3(): 0x0000000000000000
MethodInstance for Foo.foo4(): 0xffffffffffffffff Comparing with the inference trees, this makes sense: Inference can stop going deeper once it reaches a matching method/specialization where there is a But why are some of the |
Ah, I think I've found something. Doesn't have anything to do with const-propagation directly, so sorry for hijacking this issue, but it should probably be sorted out before worrying about the interaction of precompilation and const-propagation. So here is what I believe is the cause of all the invalidated So I've tried reversing the logic, starting with all diff --git a/src/dump.c b/src/dump.c
index 49fa6efa43..167acbf70a 100644
--- a/src/dump.c
+++ b/src/dump.c
@@ -1616,8 +1616,10 @@ static jl_value_t *jl_deserialize_value_code_instance(jl_serializer_state *s, jl
codeinst->precompile = 1;
codeinst->next = (jl_code_instance_t*)jl_deserialize_value(s, (jl_value_t**)&codeinst->next);
jl_gc_wb(codeinst, codeinst->next);
- if (validate)
+ if (validate) {
codeinst->min_world = jl_world_counter;
+ codeinst->max_world = ~(size_t)0;
+ }
return (jl_value_t*)codeinst;
}
@@ -2042,15 +2044,15 @@ static void jl_insert_backedges(jl_array_t *list, jl_array_t *targets)
jl_method_table_add_backedge(mt, callee, (jl_value_t*)caller);
}
}
- // then enable it
+ }
+ else {
+ // invalid, so disable it
jl_code_instance_t *codeinst = caller->cache;
while (codeinst) {
if (codeinst->min_world > 0)
- codeinst->max_world = ~(size_t)0;
+ codeinst->max_world = 0;
codeinst = jl_atomic_load_relaxed(&codeinst->next);
}
- }
- else {
if (_jl_debug_method_invalidation) {
jl_array_ptr_1d_push(_jl_debug_method_invalidation, (jl_value_t*)caller);
loctag = jl_cstr_to_string("insert_backedges"); Don't know whether this is 100% sound, it might be too optimistic, but what we have at the moment is definitely too pessimistic. @vtjnash ? |
Of course it's not, as it is missing transitive invalidations. With which I mean, if But even though we thus invalidate too little with this patch, it seems to have near to zero effect latency-wise on real-world packages. Probably a large fraction of functions without external dependencies are trivial wrappers (e.g. providing default args) which don't add any relevant amount of time to inference, so invalidating them unnecessarily isn't too bad. But for the kind of analysis we are doing here, it would still be nice to get rid of these, I guess. (Also, the fact that |
It seems we could solve this by maintaining an |
Let's take this to #41872 and then come back to interaction with const-specialization here eventually. |
Our heuristics for constant propagation are imperfect (and probably never will be perfect), and I've now seen many examples of methods that no developer would ask to have const-propped get that treatment. In some cases the cost for latency/precompilation is very large. This renames `@aggressive_constprop` to `@constprop` and allows two settings, `:aggressive` and `:none`. Closes #38983 (presuming this will propagate to kw body methods)
Our heuristics for constant propagation are imperfect (and probably never will be perfect), and I've now seen many examples of methods that no developer would ask to have const-propped get that treatment. In some cases the cost for latency/precompilation is very large. This renames `@aggressive_constprop` to `@constprop` and allows two settings, `:aggressive` and `:none`. Closes #38983 Co-authored-by: Shuhei Kadowaki <[email protected]> Co-authored-by: Martin Holters <[email protected]>
Our heuristics for constant propagation are imperfect (and probably never will be perfect), and I've now seen many examples of methods that no developer would ask to have const-propped get that treatment. In some cases the cost for latency/precompilation is very large. This renames `@aggressive_constprop` to `@constprop` and allows two settings, `:aggressive` and `:none`. Closes #38983 Co-authored-by: Shuhei Kadowaki <[email protected]> Co-authored-by: Martin Holters <[email protected]>
In JuliaLang#38983 and JuliaLang#41872, it was discovered that only CodeInstances with external backedges get validated after deserialization. This adds a "second chance" for each CodeInstance: it validates any that have neither been validated nor invalidated by the end of deserialization. Closes JuliaLang#41872
Our heuristics for constant propagation are imperfect (and probably never will be perfect), and I've now seen many examples of methods that no developer would ask to have const-propped get that treatment. In some cases the cost for latency/precompilation is very large. This renames `@aggressive_constprop` to `@constprop` and allows two settings, `:aggressive` and `:none`. Closes JuliaLang#38983 Co-authored-by: Shuhei Kadowaki <[email protected]> Co-authored-by: Martin Holters <[email protected]>
In JuliaLang#38983 and JuliaLang#41872, it was discovered that only CodeInstances with external backedges get validated after deserialization. This adds a "second chance" for each CodeInstance: it validates any that have neither been validated nor invalidated by the end of deserialization. Closes JuliaLang#41872
Our heuristics for constant propagation are imperfect (and probably never will be perfect), and I've now seen many examples of methods that no developer would ask to have const-propped get that treatment. In some cases the cost for latency/precompilation is very large. This renames `@aggressive_constprop` to `@constprop` and allows two settings, `:aggressive` and `:none`. Closes JuliaLang#38983 Co-authored-by: Shuhei Kadowaki <[email protected]> Co-authored-by: Martin Holters <[email protected]>
(I am not 100% certain this is the right title for this issue, but it seems likely to be the cause.)
TL; DR
In reference to #30488 (comment), this illustrates at least one (and probably more than one) problem with precompilation. I think the most fundamental issue is that our backedges only record MethodInstances and not optimizations due to constant-propagation/specialization; when you go to use the code, Julia seems to decide that it needs to re-optimize the code and thus runs a fresh round of inference on it. A potential solution is to incorporate
Const
-data into our backedges, and cache all specializations. (There are potential cache-blowup issues to worry about, of course.) An alternative is to just use the partially-specialized code and not run inference again.It's worth noting that the tiny function in this demo accounts for the majority of remaining (aka, unfixed) inference time in JuliaImages/ImageFiltering.jl#201, despite being outweighed by at least 100x in terms of the total amount of code that needs to be inferred to run
imfilter
. Another example I've encountered occurs inJuliaInterpreter.step_expr!
, where I suspectistoplevel::Bool
suffers from the same problem due to well-intentioned but very unhelpfulConst
-specialization. (I'm finding this issue to be a bigger problem with LoopVectorization, where it crops up for dozens of methods and is the main source of currently-unfixable "inference triggers.") Overall, this is a sign of just how wellprecompile
works in many circumstances. The failures illustrated here are serious but specific.Demo:
Session (requires the master branch of SnoopCompile):
Generally I would have expected no inference at all, given that we already had the MethodInstance. Note that in the tree output, you can see the
Const
-specialization from sequences likefoo(1::Int64)
which indicate thatfoo
wasConst
-specialized with an argument of value1::Int64
.It's worth starting a fresh session and doing the same analysis for
gethiddengaussian
, to see that it has different entrance-points into inference. But if you directly precompilehiddengaussian
(uncommenting that line), then you get back to something almost identical togetgaussian
.The text was updated successfully, but these errors were encountered: