Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inference in vcat #2559

Merged
merged 6 commits into from
Nov 25, 2020
Merged

Improve inference in vcat #2559

merged 6 commits into from
Nov 25, 2020

Conversation

timholy
Copy link
Contributor

@timholy timholy commented Nov 24, 2020

Master:

julia> @time include("cat.jl")
Test Summary: | Pass  Total
...
 13.921540 seconds (30.55 M allocations: 1.835 GiB, 3.81% gc time, 97.72% compilation time)

This PR:

 12.296551 seconds (28.88 M allocations: 1.728 GiB, 3.05% gc time, 97.42% compilation time)

This is of course from a fresh session in both cases, and all the gains are to inference time on first use.

The idea is that improvements in inference quality let your precompile statements do you more good because they reach deeper into the call stack. It may be helpful to outline some of the principles that guide these changes:

  • the change to reduce tries to "standardize" the types that get passed to _vcat; any DF that is fully concrete can be specialized, otherwise we standardize on AbstractDataFrame (which is already used in some places) as the fallback. This reduces the number of times you have to infer _vcat for abstract types.
  • _vcat had some inference problems from the infamous julia#15276
  • methods that get called for AbstractDataFrame are likely to return poorly-inferred types (they may be well-inferred for any concrete subtype, but poorly inferred for the abstract type). However, in some cases you can fix them, as exemplified by annotating the return type of names. This stabilizes a lot of _vcat's body when dfs::Vector{AbstractDataFrame}.

There is one part of _vcat that is still poorly inferred:

newcols = map(dfs) do df
if hasproperty(df, name)
return df[!, name]
else
Iterators.repeated(missing, nrow(df))
end
end

Basically anything that uses newcols is uninferrable. I didn't know enough to fix it. However, if that Iterators.repeated is just going to be expanded to a Vector{Missing} anyway (I'm not sure it will), then just writing it in terms of fill instead might fix it. Another option would be to annotate the type of newcols itself, or to annotate lens as a Vector{Int}. However, I expect these to be relatively minor gains.

I discovered these through a combination of @snoopi,@snoopi_deep (which still needs documentation), and of course Cthulhu (which sadly seems to have broken on 1.6 in the last few days, see JuliaDebug/Cthulhu.jl#100. For the moment one has to diagnose problems on 1.5 and analyze the impact of improvements on 1.6).

@bkamins bkamins added non-breaking The proposed change is not breaking performance labels Nov 24, 2020
@bkamins bkamins added this to the 1.0 milestone Nov 24, 2020
AbstractVector{<:AbstractString}}=:setequal) =
_vcat([df for df in dfs if ncol(df) != 0]; cols=cols)
AbstractVector{<:AbstractString}}=:setequal) where DF<:AbstractDataFrame
if @isdefined(DF) && isconcretetype(DF)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why calling @isfefined(DF) is required here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuple inputs with 2 or more different types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then isconcretetype(DF) should return false - right? (or would it error)

Copy link
Contributor Author

@timholy timholy Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I take the @isdefined check out and preface that conditional with a @show typeof(dfs), here's what I get from running the cat.jl test:

julia> include("cat.jl")
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
Test Summary: | Pass  Total
hcat          |   10     10
Test Summary: | Pass  Total
hcat: copying |   26     26
Test Summary:            | Pass  Total
hcat ::AbstractDataFrame |    2      2
Test Summary:            | Pass  Total
hcat ::AbstractDataFrame |    2      2
Test Summary:          | Pass  Total
hcat ::AbstractVectors |   11     11
Test Summary:  | Pass  Total
hcat: copycols |   76     76
typeof(dfs) = Tuple{DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = NTuple{4, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}}
vcat: Error During Test at /home/tim/.julia/dev/DataFrames/test/cat.jl:269
  Test threw exception
  Expression: vcat(view(df, 1:2, :), view(df, 3:5, [3, 2, 1])) == df
  UndefVarError: DF not defined
  Stacktrace:
   [1] reduce(::typeof(vcat), dfs::Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}}; cols::Symbol)
     @ DataFrames ~/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:1556
...

So as soon as it gets a heterogeneous tuple (one has an Index and the other a SubIndex), it throws an UndefVarError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what I do not understand is the fact that you use DF <: AbstractDataFrame (instead of the original implementation). And if I try to do a MWE I get:

julia> using DataFrames

julia> f(::Tuple{Vararg{DF}}) where {DF<:AbstractDataFrame} = "matched"
f (generic function with 1 method)

julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df1 = @view df[:, :]
1×2 SubDataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      2

julia> df2 = @view df[:, 1:1]
1×1 SubDataFrame
 Row │ a
     │ Int64
─────┼───────
   1 │     1

julia> f((df, df))
"matched"

julia> f((df1, df1))
"matched"

julia> f((df2, df2))
"matched"

julia> f((df, df1))
ERROR: MethodError: no method matching f(::Tuple{DataFrame,SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}}})

julia> f((df1, df2))
ERROR: MethodError: no method matching f(::Tuple{SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}},SubDataFrame{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}},Base.OneTo{Int64}}})

julia> h(::Tuple{Vararg{AbstractDataFrame}}) = "matched"
h (generic function with 1 method)

julia> h((df, df))
"matched"

julia> h((df, df1))
"matched"

julia> h((df2, df1))
"matched"

so it seems that the pattern Tuple{Vararg{DF}} where DF <: AbstractDataFrame is not the same as the original and intended Tuple{Vararg{AbstractDataFrame}}

Copy link
Contributor Author

@timholy timholy Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems to exploit a relatively recent & subtle change in how type parameters. Basically the Tuple{Vararg{DF}} is requiring a single concrete value to set DF, but signature matching is the same as <:AbstractDataFrame. EDIT: but your example shows the matching is incomplete.

Another way we could do this is nominal-runtime:

DF = typeof(first(dfs))
if isconcretetype(DF) && all(df->typeof(df) === DF, dfs) ... end

Do you like that better?

My workflow for these things isn't very deliberative; I typically write it first without the @isdefined, and then when I get an error like the one above I add @isdefined. If you grep this in Base you'll see quite a few similar uses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear, now I see your point with the two different dfs. Very interesting. I'll do the runtime version perhaps.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate AbstractVector and Tuple implementations then.
Also for Tuple please make sure that in the run-time version a case when the tuple is empty is correctly handled.

I have now noticed a subtle bug that should be fixed:

julia> reduce(vcat, ())
ERROR: ArgumentError: reducing over an empty collection is not allowed

julia> using DataFrames

julia> reduce(vcat, ())
0×0 DataFrame

To be decided if we keep:

julia> reduce(vcat, AbstractDataFrame[])
0×0 DataFrame

or also error here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is weird. It does match in situ, even though your example fails for me. 😕

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show me the output of the same commands in the fresh Julia session (and what Julia are you using I am on 1.5.3). Thank you! (extending functions from Base is hard)

@timholy
Copy link
Contributor Author

timholy commented Nov 24, 2020

That's interesting: versions older than Julia 1.5 don't dispatch to the new version of reduce. I pushed a commit restricting that change to versions 1.5 and above.

@bkamins bkamins added the bug label Nov 24, 2020
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

These are both concrete types so parametrizing/nospecialize don't
have any effect.
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have the problem:

julia> reduce(vcat, ())
0×0 DataFrame

in this branch

@timholy
Copy link
Contributor Author

timholy commented Nov 25, 2020

An even simpler approach seems to work pretty well. I was worried about the lack of specialization hurting performance, but given that a single return-type annotation fixes inference on AbstractDataFrame for this function, that proves unfounded. This branch:

julia> using DataFrames, BenchmarkTools

julia> df = DataFrame(a=1,b=2);

julia> df1 = @view df[:, :];

julia> df2 = @view df[:, 1:1];

julia> cols = [:a];

julia> @btime vcat($df1, $df2; cols=$cols);
  4.527 μs (88 allocations: 6.69 KiB)

julia> @btime vcat($df, $df);
  5.719 μs (79 allocations: 5.98 KiB)

vs 0.22.1:

julia> @btime vcat($df1, $df2; cols=$cols);
  5.697 μs (97 allocations: 6.78 KiB)

julia> @btime vcat($df, $df);
  7.484 μs (88 allocations: 6.39 KiB)

And this implementation gives approximately the same speedup of running the cat.jl tests (which is much more about compile-time latency than runtime performance).

I threw in a couple of changes to signatures that I noticed during other perusals, though I can strip them out if you'd prefer:

  • AsTable is concrete, so <:AsTable is the same as AsTable
  • Vector{Any} is actually a fully-specialized type (its elements are not, but the container is). Hence @nospecialize(v::Vector{Any}) doesn't do anything.

@bkamins bkamins merged commit 86a5ee6 into JuliaData:master Nov 25, 2020
@bkamins
Copy link
Member

bkamins commented Nov 25, 2020

Thank you! I will add tests in a separate PR.

@timholy timholy deleted the teh/inference branch November 25, 2020 17:31
nalimilan pushed a commit that referenced this pull request Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug non-breaking The proposed change is not breaking performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants