-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve inference in vcat
#2559
Conversation
AbstractVector{<:AbstractString}}=:setequal) = | ||
_vcat([df for df in dfs if ncol(df) != 0]; cols=cols) | ||
AbstractVector{<:AbstractString}}=:setequal) where DF<:AbstractDataFrame | ||
if @isdefined(DF) && isconcretetype(DF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why calling @isfefined(DF)
is required here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tuple inputs with 2 or more different types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then isconcretetype(DF)
should return false
- right? (or would it error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I take the @isdefined
check out and preface that conditional with a @show typeof(dfs)
, here's what I get from running the cat.jl
test:
julia> include("cat.jl")
[ Info: Precompiling DataFrames [a93c6f00-e57d-5684-b7b6-d8193f3e46c0]
Test Summary: | Pass Total
hcat | 10 10
Test Summary: | Pass Total
hcat: copying | 26 26
Test Summary: | Pass Total
hcat ::AbstractDataFrame | 2 2
Test Summary: | Pass Total
hcat ::AbstractDataFrame | 2 2
Test Summary: | Pass Total
hcat ::AbstractVectors | 11 11
Test Summary: | Pass Total
hcat: copycols | 76 76
typeof(dfs) = Tuple{DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = NTuple{4, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.DataFrame, DataFrames.DataFrame}
typeof(dfs) = Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}}
vcat: Error During Test at /home/tim/.julia/dev/DataFrames/test/cat.jl:269
Test threw exception
Expression: vcat(view(df, 1:2, :), view(df, 3:5, [3, 2, 1])) == df
UndefVarError: DF not defined
Stacktrace:
[1] reduce(::typeof(vcat), dfs::Tuple{DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.Index, UnitRange{Int64}}, DataFrames.SubDataFrame{DataFrames.DataFrame, DataFrames.SubIndex{DataFrames.Index, Vector{Int64}, Vector{Int64}}, UnitRange{Int64}}}; cols::Symbol)
@ DataFrames ~/.julia/dev/DataFrames/src/abstractdataframe/abstractdataframe.jl:1556
...
So as soon as it gets a heterogeneous tuple (one has an Index
and the other a SubIndex
), it throws an UndefVarError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what I do not understand is the fact that you use DF <: AbstractDataFrame
(instead of the original implementation). And if I try to do a MWE I get:
julia> using DataFrames
julia> f(::Tuple{Vararg{DF}}) where {DF<:AbstractDataFrame} = "matched"
f (generic function with 1 method)
julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df1 = @view df[:, :]
1×2 SubDataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 2
julia> df2 = @view df[:, 1:1]
1×1 SubDataFrame
Row │ a
│ Int64
─────┼───────
1 │ 1
julia> f((df, df))
"matched"
julia> f((df1, df1))
"matched"
julia> f((df2, df2))
"matched"
julia> f((df, df1))
ERROR: MethodError: no method matching f(::Tuple{DataFrame,SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}}})
julia> f((df1, df2))
ERROR: MethodError: no method matching f(::Tuple{SubDataFrame{DataFrame,DataFrames.Index,Base.OneTo{Int64}},SubDataFrame{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}},Base.OneTo{Int64}}})
julia> h(::Tuple{Vararg{AbstractDataFrame}}) = "matched"
h (generic function with 1 method)
julia> h((df, df))
"matched"
julia> h((df, df1))
"matched"
julia> h((df2, df1))
"matched"
so it seems that the pattern Tuple{Vararg{DF}} where DF <: AbstractDataFrame
is not the same as the original and intended Tuple{Vararg{AbstractDataFrame}}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems to exploit a relatively recent & subtle change in how type parameters. Basically the Tuple{Vararg{DF}}
is requiring a single concrete value to set DF
, but signature matching is the same as <:AbstractDataFrame
. EDIT: but your example shows the matching is incomplete.
Another way we could do this is nominal-runtime:
DF = typeof(first(dfs))
if isconcretetype(DF) && all(df->typeof(df) === DF, dfs) ... end
Do you like that better?
My workflow for these things isn't very deliberative; I typically write it first without the @isdefined
, and then when I get an error like the one above I add @isdefined
. If you grep
this in Base
you'll see quite a few similar uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dear, now I see your point with the two different df
s. Very interesting. I'll do the runtime version perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would separate AbstractVector
and Tuple
implementations then.
Also for Tuple
please make sure that in the run-time version a case when the tuple is empty is correctly handled.
I have now noticed a subtle bug that should be fixed:
julia> reduce(vcat, ())
ERROR: ArgumentError: reducing over an empty collection is not allowed
julia> using DataFrames
julia> reduce(vcat, ())
0×0 DataFrame
To be decided if we keep:
julia> reduce(vcat, AbstractDataFrame[])
0×0 DataFrame
or also error here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is weird. It does match in situ, even though your example fails for me. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show me the output of the same commands in the fresh Julia session (and what Julia are you using I am on 1.5.3). Thank you! (extending functions from Base is hard)
That's interesting: versions older than Julia 1.5 don't dispatch to the new version of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
These are both concrete types so parametrizing/nospecialize don't have any effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still have the problem:
julia> reduce(vcat, ())
0×0 DataFrame
in this branch
An even simpler approach seems to work pretty well. I was worried about the lack of specialization hurting performance, but given that a single return-type annotation fixes inference on julia> using DataFrames, BenchmarkTools
julia> df = DataFrame(a=1,b=2);
julia> df1 = @view df[:, :];
julia> df2 = @view df[:, 1:1];
julia> cols = [:a];
julia> @btime vcat($df1, $df2; cols=$cols);
4.527 μs (88 allocations: 6.69 KiB)
julia> @btime vcat($df, $df);
5.719 μs (79 allocations: 5.98 KiB) vs 0.22.1: julia> @btime vcat($df1, $df2; cols=$cols);
5.697 μs (97 allocations: 6.78 KiB)
julia> @btime vcat($df, $df);
7.484 μs (88 allocations: 6.39 KiB) And this implementation gives approximately the same speedup of running the I threw in a couple of changes to signatures that I noticed during other perusals, though I can strip them out if you'd prefer:
|
Co-authored-by: Bogumił Kamiński <[email protected]>
Thank you! I will add tests in a separate PR. |
Master:
This PR:
This is of course from a fresh session in both cases, and all the gains are to inference time on first use.
The idea is that improvements in inference quality let your precompile statements do you more good because they reach deeper into the call stack. It may be helpful to outline some of the principles that guide these changes:
reduce
tries to "standardize" the types that get passed to_vcat
; anyDF
that is fully concrete can be specialized, otherwise we standardize onAbstractDataFrame
(which is already used in some places) as the fallback. This reduces the number of times you have to infer_vcat
for abstract types._vcat
had some inference problems from the infamous julia#15276AbstractDataFrame
are likely to return poorly-inferred types (they may be well-inferred for any concrete subtype, but poorly inferred for the abstract type). However, in some cases you can fix them, as exemplified by annotating the return type ofnames
. This stabilizes a lot of_vcat
's body whendfs::Vector{AbstractDataFrame}
.There is one part of
_vcat
that is still poorly inferred:DataFrames.jl/src/abstractdataframe/abstractdataframe.jl
Lines 1617 to 1623 in 8645651
Basically anything that uses
newcols
is uninferrable. I didn't know enough to fix it. However, if thatIterators.repeated
is just going to be expanded to aVector{Missing}
anyway (I'm not sure it will), then just writing it in terms offill
instead might fix it. Another option would be to annotate the type ofnewcols
itself, or to annotatelens
as aVector{Int}
. However, I expect these to be relatively minor gains.I discovered these through a combination of
@snoopi
,@snoopi_deep
(which still needs documentation), and of course Cthulhu (which sadly seems to have broken on 1.6 in the last few days, see JuliaDebug/Cthulhu.jl#100. For the moment one has to diagnose problems on 1.5 and analyze the impact of improvements on 1.6).