Improve inference in `vcat` #2559

timholy · 2020-11-24T14:21:44Z

Master:

julia> @time include("cat.jl")
Test Summary: | Pass  Total
...
 13.921540 seconds (30.55 M allocations: 1.835 GiB, 3.81% gc time, 97.72% compilation time)

This PR:

 12.296551 seconds (28.88 M allocations: 1.728 GiB, 3.05% gc time, 97.42% compilation time)

This is of course from a fresh session in both cases, and all the gains are to inference time on first use.

The idea is that improvements in inference quality let your precompile statements do you more good because they reach deeper into the call stack. It may be helpful to outline some of the principles that guide these changes:

the change to reduce tries to "standardize" the types that get passed to _vcat; any DF that is fully concrete can be specialized, otherwise we standardize on AbstractDataFrame (which is already used in some places) as the fallback. This reduces the number of times you have to infer _vcat for abstract types.
_vcat had some inference problems from the infamous julia#15276
methods that get called for AbstractDataFrame are likely to return poorly-inferred types (they may be well-inferred for any concrete subtype, but poorly inferred for the abstract type). However, in some cases you can fix them, as exemplified by annotating the return type of names. This stabilizes a lot of _vcat's body when dfs::Vector{AbstractDataFrame}.

There is one part of _vcat that is still poorly inferred:

DataFrames.jl/src/abstractdataframe/abstractdataframe.jl

Lines 1617 to 1623 in 8645651

    
           newcols = map(dfs) do df 
        
               if hasproperty(df, name) 
        
                   return df[!, name] 
        
               else 
        
                   Iterators.repeated(missing, nrow(df)) 
        
               end 
        
           end

Basically anything that uses newcols is uninferrable. I didn't know enough to fix it. However, if that Iterators.repeated is just going to be expanded to a Vector{Missing} anyway (I'm not sure it will), then just writing it in terms of fill instead might fix it. Another option would be to annotate the type of newcols itself, or to annotate lens as a Vector{Int}. However, I expect these to be relatively minor gains.

I discovered these through a combination of @snoopi,@snoopi_deep (which still needs documentation), and of course Cthulhu (which sadly seems to have broken on 1.6 in the last few days, see JuliaDebug/Cthulhu.jl#100. For the moment one has to diagnose problems on 1.5 and analyze the impact of improvements on 1.6).

bkamins · 2020-11-24T14:39:11Z

src/abstractdataframe/abstractdataframe.jl

-                        AbstractVector{<:AbstractString}}=:setequal) =
-    _vcat([df for df in dfs if ncol(df) != 0]; cols=cols)
+                        AbstractVector{<:AbstractString}}=:setequal) where DF<:AbstractDataFrame
+    if @isdefined(DF) && isconcretetype(DF)


why calling @isfefined(DF) is required here?

Tuple inputs with 2 or more different types.

But then isconcretetype(DF) should return false - right? (or would it error)

If I take the @isdefined check out and preface that conditional with a @show typeof(dfs), here's what I get from running the cat.jl test:

julia> include("cat.jl") [ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0] Test Summary: | Pass Total hcat | 10 10 Test Summary: | Pass Total hcat: copying | 26 26 Test Summary: | Pass Total hcat ::AbstractDataFrame | 2 2 Test Summary: | Pass Total hcat ::AbstractDataFrame | 2 2 Test Summary: | Pass Total hcat ::AbstractVectors | 11 11 Test Summary: | Pass Total hcat: copycols | 76 76 typeof(dfs) = Tuple{DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = NTuple{4, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame} typeof(dfs) = Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}} vcat: Error During Test at /home/tim/.julia/dev/DataFrames/test/cat.jl:269 Test threw exception Expression: vcat(view(df, 1:2, :), view(df, 3:5, [3, 2, 1])) == df UndefVarError: DF not defined Stacktrace: [1] reduce(::typeof(vcat), dfs::Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}}; cols::Symbol) @ DataFrames ~/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:1556 ...

So as soon as it gets a heterogeneous tuple (one has an Index and the other a SubIndex), it throws an UndefVarError.

But what I do not understand is the fact that you use DF <: AbstractDataFrame (instead of the original implementation). And if I try to do a MWE I get:

julia> using DataFrames julia> f(::Tuple{Vararg{DF}}) where {DF<:AbstractDataFrame} = "matched" f (generic function with 1 method) julia> df = DataFrame(a=1,b=2) 1×2 DataFrame Row │ a b │ Int64 Int64 ─────┼────────────── 1 │ 1 2 julia> df1 = @view df[:, :] 1×2 SubDataFrame Row │ a b │ Int64 Int64 ─────┼────────────── 1 │ 1 2 julia> df2 = @view df[:, 1:1] 1×1 SubDataFrame Row │ a │ Int64 ─────┼─────── 1 │ 1 julia> f((df, df)) "matched" julia> f((df1, df1)) "matched" julia> f((df2, df2)) "matched" julia> f((df, df1)) ERROR: MethodError: no method matching f(::Tuple{DataFrame,SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}}}) julia> f((df1, df2)) ERROR: MethodError: no method matching f(::Tuple{SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}},SubDataFrame{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}},Base.OneTo{Int64}}}) julia> h(::Tuple{Vararg{AbstractDataFrame}}) = "matched" h (generic function with 1 method) julia> h((df, df)) "matched" julia> h((df, df1)) "matched" julia> h((df2, df1)) "matched"

so it seems that the pattern Tuple{Vararg{DF}} where DF <: AbstractDataFrame is not the same as the original and intended Tuple{Vararg{AbstractDataFrame}}

Yes, this seems to exploit a relatively recent & subtle change in how type parameters. Basically the Tuple{Vararg{DF}} is requiring a single concrete value to set DF, but signature matching is the same as <:AbstractDataFrame. EDIT: but your example shows the matching is incomplete.

Another way we could do this is nominal-runtime:

DF = typeof(first(dfs)) if isconcretetype(DF) && all(df->typeof(df) === DF, dfs) ... end

Do you like that better?

My workflow for these things isn't very deliberative; I typically write it first without the @isdefined, and then when I get an error like the one above I add @isdefined. If you grep this in Base you'll see quite a few similar uses.

Oh dear, now I see your point with the two different dfs. Very interesting. I'll do the runtime version perhaps.

I would separate AbstractVector and Tuple implementations then.
Also for Tuple please make sure that in the run-time version a case when the tuple is empty is correctly handled.

I have now noticed a subtle bug that should be fixed:

julia> reduce(vcat, ()) ERROR: ArgumentError: reducing over an empty collection is not allowed julia> using DataFrames julia> reduce(vcat, ()) 0×0 DataFrame

To be decided if we keep:

julia> reduce(vcat, AbstractDataFrame[]) 0×0 DataFrame

or also error here.

That is weird. It does match in situ, even though your example fails for me. 😕

can you show me the output of the same commands in the fresh Julia session (and what Julia are you using I am on 1.5.3). Thank you! (extending functions from Base is hard)

src/other/index.jl

timholy · 2020-11-24T15:44:55Z

That's interesting: versions older than Julia 1.5 don't dispatch to the new version of reduce. I pushed a commit restricting that change to versions 1.5 and above.

nalimilan

Thanks!

These are both concrete types so parametrizing/nospecialize don't have any effect.

bkamins

We still have the problem:

julia> reduce(vcat, ())
0×0 DataFrame

in this branch

timholy · 2020-11-25T10:27:08Z

An even simpler approach seems to work pretty well. I was worried about the lack of specialization hurting performance, but given that a single return-type annotation fixes inference on AbstractDataFrame for this function, that proves unfounded. This branch:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(a=1,b=2);

julia> df1 = @view df[:, :];

julia> df2 = @view df[:, 1:1];

julia> cols = [:a];

julia> @btime vcat($df1, $df2; cols=$cols);
  4.527 μs (88 allocations: 6.69 KiB)

julia> @btime vcat($df, $df);
  5.719 μs (79 allocations: 5.98 KiB)

vs 0.22.1:

julia> @btime vcat($df1, $df2; cols=$cols);
  5.697 μs (97 allocations: 6.78 KiB)

julia> @btime vcat($df, $df);
  7.484 μs (88 allocations: 6.39 KiB)

And this implementation gives approximately the same speedup of running the cat.jl tests (which is much more about compile-time latency than runtime performance).

I threw in a couple of changes to signatures that I noticed during other perusals, though I can strip them out if you'd prefer:

AsTable is concrete, so <:AsTable is the same as AsTable
Vector{Any} is actually a fully-specialized type (its elements are not, but the container is). Hence @nospecialize(v::Vector{Any}) doesn't do anything.

src/abstractdataframe/abstractdataframe.jl

Co-authored-by: Bogumił Kamiński <[email protected]>

bkamins · 2020-11-25T17:11:46Z

Thank you! I will add tests in a separate PR.

Improve inference in vcat

6927f23

bkamins added non-breaking The proposed change is not breaking performance labels Nov 24, 2020

bkamins added this to the 1.0 milestone Nov 24, 2020

bkamins reviewed Nov 24, 2020

View reviewed changes

src/other/index.jl Show resolved Hide resolved

restrict new signature to >=1.5

33d53f3

bkamins added the bug label Nov 24, 2020

Avoid the type-parameter shenanigans

acd7ee9

nalimilan approved these changes Nov 25, 2020

View reviewed changes

timholy added 2 commits November 25, 2020 03:59

standardize on AbstractDataFrame

107294a

Simplify a couple of signatures

96ba775

These are both concrete types so parametrizing/nospecialize don't have any effect.

bkamins requested changes Nov 25, 2020

View reviewed changes

bkamins reviewed Nov 25, 2020

View reviewed changes

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved

Update src/abstractdataframe/abstractdataframe.jl

2a40aee

Co-authored-by: Bogumił Kamiński <[email protected]>

bkamins approved these changes Nov 25, 2020

View reviewed changes

bkamins merged commit 86a5ee6 into JuliaData:master Nov 25, 2020

timholy deleted the teh/inference branch November 25, 2020 17:31

nalimilan pushed a commit that referenced this pull request Nov 25, 2020

Improve inference in vcat (#2559)

ac55a99

timholy mentioned this pull request Nov 28, 2020

Split methods to reduce compile cost of argument diversity #2563

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inference in `vcat` #2559

Improve inference in `vcat` #2559

timholy commented Nov 24, 2020 •

edited

Loading

bkamins Nov 24, 2020

timholy Nov 24, 2020

bkamins Nov 24, 2020

timholy Nov 24, 2020 •

edited

Loading

bkamins Nov 24, 2020

timholy Nov 24, 2020 •

edited

Loading

timholy Nov 24, 2020

bkamins Nov 24, 2020

timholy Nov 24, 2020

bkamins Nov 24, 2020

timholy commented Nov 24, 2020

nalimilan left a comment

bkamins left a comment

timholy commented Nov 25, 2020 •

edited

Loading

bkamins commented Nov 25, 2020

	newcols = map(dfs) do df
	if hasproperty(df, name)
	return df[!, name]
	else
	Iterators.repeated(missing, nrow(df))
	end
	end

Improve inference in vcat #2559

Improve inference in vcat #2559

Conversation

timholy commented Nov 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy Nov 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy Nov 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy commented Nov 24, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins left a comment

Choose a reason for hiding this comment

timholy commented Nov 25, 2020 • edited Loading

bkamins commented Nov 25, 2020

Improve inference in `vcat` #2559

Improve inference in `vcat` #2559

timholy commented Nov 24, 2020 •

edited

Loading

timholy Nov 24, 2020 •

edited

Loading

timholy Nov 24, 2020 •

edited

Loading

timholy commented Nov 25, 2020 •

edited

Loading