-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ideas for better static type inference #451
Comments
Those are intriguing ideas, Simon. I've been thinking about trying to make Maybe we could have a "loose" DataFrame type and another "locked-in" On Sun, Dec 15, 2013 at 7:08 PM, Simon Kornblith
|
Actually, it should be totally possible to make this work. It would just mean creating a new tuple, which might not be efficient DataFrames with a lot of columns, but it should work. |
But we need to parametrize the DataFrame by the type of the tuple to get type inference on it, and if we do that, we can't then go change that type without creating a new DataFrame. |
Okay, that's true with your proposal. How about skipping that parameterization, but under the hood, still getindex(df::DataFrame, row_ind::Real, col_ind::ColumnIndex) = function df_getindex{T}(columns::T, row_ind::Real, col_ind::ColumnIndex) Would this work in a similar manner to your initial proposal? Kevin On Sun, Dec 15, 2013 at 5:04 PM, Simon Kornblith
|
This is definitely a really cool idea. One thing we might do before making any decisions to scrap the current internals is to finally build up some reliable benchmarks. It seems, from a shallow reading, like the proposal would improve somethings and worsen others. Assuming that's true, it would be nice to know that more benchmarks are improved by this change than are worsened. It seems like that should be the case, but it would be testing. |
As I think about it more, this approach doesn't actually help much, because type inference for tuple indexing with a constant index is implemented by a t-function in immutable A{T}
a::T
end
Base.getindex(x::A, i::Int) = x.a[i]
f(x) = (y = x[1]; y) then: julia> code_typed(f, ((Int, Float64),))
1-element Array{Any,1}:
:($(Expr(:lambda, {:x}, {{:y},{{:x,(Int64,Float64),0},{:y,Int64,18}},{}}, quote # none, line 1:
y = tupleref(x::(Int64,Float64),1)::Union(Float64,Int64)
return y::Int64
end)))
julia> code_typed(f, (A{(Int, Float64)},))
1-element Array{Any,1}:
:($(Expr(:lambda, {:x}, {{:y},{{:x,A{(Int64,Float64)},0},{:y,Union(Float64,Int64),18}},{}}, quote # none, line 1:
y = tupleref(top(getfield)(x::A{(Int64,Float64)},:a)::(Int64,Float64),1)::Union(Float64,Int64)
return y::Union(Float64,Int64)
end))) So we don't get concrete type information we'd like if we index into a type that wraps the tuple with a constant index. To get concrete type information, we'd need "rerun type inference after inlining in some cases" from JuliaLang/julia#3440 or our own t-function. OTOH, even knowing the union type might be better than nothing, depending on what later code does. I also worry that we'd be compiling code for every combination of DataFrame column types, which might be bad if there is code that uses a large number of DataFrames, but is probably fine in general. @kmsquire I don't think that would help much, since calling @johnmyleswhite I agree that any decisions we make should be informed by benchmarks. |
Closing in favor of #1335. |
I had a discussion with @malmaud today about type inference for DataFrames. The core problem is that when you index into a DataFrame, under the hood, we index into a
Vector{Any}
and so you get no type information about the result. Thus, if you want to iterate over one or more columns of a DataFrame, the only reasonably performant way to do so is to first pull out the columns with typeasserts (or pass them to a function) and then iterate over these columns, which is not really idiomatic.Part of one potential solution would be to make the columns of a DataFrame a tuple instead of a
Vector{Any}
, and to parametrize the DataFrame by the type of this tuple. Then we'd have type information for the columns if referenced by index, and we'd at least have a type union for indexing by a string. The downside to this approach is that we wouldn't be able to add columns to an existing DataFrame, only create new DataFrames with more columns or re-ordered columns. Syntax likedf["newcol"] = x
wouldn't work.This could make code that indexes a DataFrame as
df[i, 1]
nearly as fast as indexing a corresponding Array. However, ideally code that indexes a DataFrame asdf[i, "mycol"]
would also be as fast as indexing an Array. This doesn't seem to be possible without changes in Base. We'd have to push the process of looking up the column name in the Index to type inference time, so that 1) we don't have to perform the look up on each loop iteration and 2) type inference has information about the type of the column. We'd either need a way to hook into the type inference process or some kind of "named tuple" primitive that has type information for each member like a tuple but can be indexed with strings.The text was updated successfully, but these errors were encountered: