Skip to content
This repository has been archived by the owner on Jul 13, 2021. It is now read-only.

Add some nospecialize and a few precompiles #502

Closed
wants to merge 1 commit into from

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Sep 22, 2020

This, together with some companion PRs in several other packages, shaves about 30s off the time to run the tests in this package (from 214s to 187s). Its effect on TTFP is fairly minimal (maybe shaves off 2.5s, which appears to be bigger than the noise but not massively so).

A few comments are in order. The main trick here is to add @nospecialize. There is some risk that this could lead to a runtime penalty, but I don't know these packages well enough to say. Consequently, this is something that is better analyzed by people more knowledgeable than I. Of course lack of specialization will only affect methods that do "real work," and from what I could tell most/all of these just call the things that do the real work. But I could easily be mistaken.

Second, I added a few precompiles. I typically add an @assert in front of these so that I know whether they have gone stale (if precompile can't find the method from the supplied signature, it just returns false rather than errorring). Feel free to strip the @asserts if you prefer to not be bothered due to API refactoring. However, this revealed something quite interesting: some of your signatures simply can't be precompiled. I left a note in one of these packages about the ultimate origin. I did not poke at this quite long enough to figure out why it doesn't work.

Third, there are quite a lot of anonymous functions that would be nice to precompile. I've marked many of them, but to make that work robustly you'd actually have to split them out and name them. Alternatively, perhaps one could insert @nospecializes in those do blocks too? (I've never tried.)

Finally, I only tested on Julia 1.6; on 1.5, you might get so much invalidation that these precompiles won't help. But the @nospecialize should be useful regardless.

At the end, display(plot(rand(5))) was still about half inference time. That suggests there is probably significantly more progress to be made. Consequently, let me summarize my workflow. I use SnoopCompileCore's @snoopi, but these days I almost never run the "automated" precompile generation stuff. Instead, I look at the results, think about them, and then intervene manually. I only worry about the big ones, say with an inference time bigger than 100ms, since for now you still have some really expensive-to-infer calls.

To detect excessive specialization, here are some good tricks. Let's say you've collected the @snoopi results to a variable tinf::Vector{Tuple{Float64,MethodInstance}}. That list (which is sorted in order of inference time) shows you the "bad actors" for specific types, but sometimes if you're inferring dozens of specializations then even a cheaper-to-infer method can add up. Consequently a nice trick (which I discovered while looking at this package) is

mtimes = Dict{Method,Float64}()
for (t, mi) in tinf
    meth = mi.def
    if isa(meth, Method)
        mtimes[meth] = t + get(mtimes, meth, 0.0)
    end
end
tinfm = [(v, k) for (k, v) in mtimes];
sort!(tinfm; by=first)

which aggregates all the MethodInstances associated with a single Method. If you see some methods that are "bad actors," then you can use filter to pull all the corresponding MethodInstances out of tinf; if there are a lot of them, consider adding a @nospecialize.

Finally, in the long run you should consider whether you need to specify as much via type parameters as you do. Some of the structs have abstract fields, and in such cases you're not going to be able to infer your dispatches anyway, so the specialization is only costing you. An example of a package where I've successfully defeated the overhead of excessive specialization is FileIO; it used to be horrible for latency, but I've finally essentially disabled all the relevant specialization (see a sequence of PRs of mine over the last several months) and it's much lighter weight now. But a good alternative is to not use so many type parameters for things. For example, in FileIO with 20/20 hindsight we could have defined a file format to have a ::Module or ::Vector{Module} field, and then used something like

load = getfield(format.module, :load)
load(filename, args; kwargs...)

and had only one format type (instead of one type per file format).

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2020

By the way, if you want to more systematically find methods that have lots of specializations, this is quite easy:

using MethodAnalysis
nmis = Pair{Method,Int}[]
julia> visit(AbstractPlotting) do item
           if item isa Method
               push!(nmis, item=>length(methodinstances(item)))
               return false
           end
           true
       end

julia> sort!(nmis; by=last)

After just running display(plot(rand(5))), some methods have more than 100 specializations (the display here is awkward, so if you're curious it's best to just try it and see).

Note that specializations count inference, but not all of these get compiled (xref JuliaLang/julia#35131). That said, the inference is costing you a lot.

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2020

In conjunction with #503, it's worth asking whether we can find a much more comprehensive set of precompiles. That, perhaps together with some improvements to inference, might get rid of some of the incentive to have so many @nospecializes.

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2020

Example:

julia> methodinstances(boundingbox)
15-element Vector{Core.MethodInstance}:
 MethodInstance for boundingbox(::Scene)
 MethodInstance for boundingbox(::String, ::Any, ::Any, ::Any, ::Any, ::Any, ::StaticArrays.SMatrix{4, 4, Float32, 16}, ::Any, ::Any)
 MethodInstance for boundingbox(::String, ::Vector{Point{3, Float32}}, ::Vector{Float32}, ::Vector{FreeTypeAbstraction.FTFont}, ::Vec{2, Float32}, ::Vector{Quaternionf0}, ::StaticArrays.SMatrix{4, 4, Float32, 16}, ::Float64, ::Float64)
 MethodInstance for boundingbox(::String, ::Any, ::Any, ::Any, ::Any, ::Any, ::StaticArrays.SMatrix{4, 4, _A, 16} where _A, ::Any, ::Any)
 MethodInstance for boundingbox(::AbstractPlotting.Text{Tuple{String}}, ::String)
 MethodInstance for boundingbox(::AbstractPlotting.Text{ArgType} where ArgType, ::String)
 MethodInstance for boundingbox(::AbstractPlotting.Text{var"#s194"} where var"#s194"<:Tuple{Arg1}, ::String)
 MethodInstance for boundingbox(::AbstractPlotting.Text{Tuple{String}})
 MethodInstance for boundingbox(::AbstractPlotting.Text{ArgType} where ArgType)
 MethodInstance for boundingbox(::AbstractPlotting.Text{var"#s194"} where var"#s194"<:Tuple{Arg1})
 MethodInstance for boundingbox(::Scatter{Tuple{Vector{Point{2, Float32}}}})
 MethodInstance for boundingbox(::LineSegments{Tuple{Vector{Point{2, Float32}}}})
 MethodInstance for boundingbox(::Atomic{Arg} where Arg)
 MethodInstance for boundingbox(::Axis2D{Tuple{Tuple{Tuple{Float32, Float32}, Tuple{Float32, Float32}}}})
 MethodInstance for boundingbox(::Annotations{Tuple{Vector{Tuple{String, Point{2, Float32}}}}})

Ideally, the non-concrete methods just wouldn't be here. At least for Text (and probably the others as well), MethodAnalysis's findcallers can tell you they arise from calls to data_limits(::Atomic)where the specific concrete type is not inferrable. I don't know this package well enough to easily trace it back farther than that (findcallers can't find all callers, as it spells out in its docs, and in this case the trail grows cold).

I'm not quite sure what to think here. I understand why you use types and type-parameters to control your dispatch---I don't yet have a proposal for keeping your code nicely organized without it---but wow is it bad for your latency. Combined kind of seems like the poster child of the challenges, both in the sense that I see why you have it but also why it's causing such serious latency problems.

I thinkPossibly adding rampant @nospecializes is probably your only hope. Maybe you should just turn off all specialization and set the optlevel to 1, and then when you encounter runtime performance bottlenecks you can move them to a special HighPerf submodule that has all the optimizations turned on? You can also add fully-concrete signatures for everything that matters, and "funnel" all argument type diversity to those concrete types, kind of like I did to avoid invalidations in REPL (which has @nospecialize module-wide): JuliaLang/julia#37081.

But sometimes that doesn't help; abstract inference is more expensive than concrete inference. Maybe the only way to fix it is a long slog of patient analysis on individual cases.

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2020

Ah, wait, one more important tool I've not explored or mentioned: Base.invokelatest. You can use this to force runtime dispatch. For example,

for plot in scene.plots    # Scene.plots is a Vector{AbstractPlot}, hence the call on `plot` can't be inferred
    Base.invokelatest(foo, plot)
end

might be better than

for plot in scene.plots    # Scene.plots is a Vector{AbstractPlot}, hence the call on `plot` can't be inferred
    foo(plot)
end

because it would force concrete inference of the foo pipeline.

@SimonDanisch
Copy link
Member

Wow, thank you so much :)

but wow is it bad for your latency. Combined kind of seems like the poster child of the challenges, both in the sense that I see why you have it but also why it's causing such serious latency problems.

I plan to refactor that and use way less type parameters!

@timholy
Copy link
Contributor Author

timholy commented Sep 23, 2020

Interesting. But I do think the invokelatest in combination with lots of precompile directives might be one of the more effective strategies. The basic idea would be to infer just what is needed, and if all of those are concrete and the relevant packages own the methods, then that should work most of the time.

Not to say it doesn't need refactoring, but I am just acknowledging that reducing the latency here is subtle and may take some experimentation.

@timholy timholy marked this pull request as draft September 24, 2020 10:19
@SimonDanisch
Copy link
Member

Is there a way forward to merge this? :)

@timholy
Copy link
Contributor Author

timholy commented Dec 7, 2020

Let's hold it a little longer. There is a whole new generation of analysis tools coming:

Once those merge, and if you don't have big refactors you're partway through, then I think it would make sense to do the analysis here from scratch. The new tools should be much better at finding the "best precompilable unit" and give a clearer sense of where your inference latency is coming from.

@timholy
Copy link
Contributor Author

timholy commented Dec 22, 2020

I am pretty sure there are more effective interventions possible, and more easily discoverable now with the new @snoopi_deep. So I'll close this.

@timholy timholy closed this Dec 22, 2020
@timholy timholy deleted the teh/ttfp branch December 22, 2020 23:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants