-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NamedTuple backing or switchable? #1949
Comments
See #1335. In particular #1335 (comment) (it is still the case on Julia 1.2) and #1335 (comment) about possible design ideas.
It can be a
I agree. This is what @nalimilan was working on some time ago. I am not sure what is the state of this work right now. I am focusing on DataFrames.jl API level changes so if someone would be willing to go back to this issue there should not be effort overlap. Random thoughts of things to consider:
|
Closing in favor of #1335. |
Thanks! Great to see this is being attempted! |
https://github.com/JuliaData/DataFrames.jl/blob/master/src/dataframe/dataframe.jl#L95
That choice seems to be a cause of almost all of the performance issues going on with DataFrames.jl. Basically, if you use anything out of a DataFrame you need a function barrier or a type assertion or it won't be inferred. Ouch. But it doesn't truly seem like it's necessary. So let's reason a little bit what a NamedTuple backing would cause to be done differently.
If that was a NamedTuple, then what would be set in stone would be the column names and their type. Everything coming out would be inferred given constant prop from the name. This would mean use with literals would be fast, while looping over all columns would fall back to the current speed. Since with this kind of data, names actually matter (i.e. some columns might be strings, so the mean of all columns doesn't really make sense), I think that fallback case would be rare.
Because everything would still be mutable arrays under the hood, mutating values would all still work the same. The only difference would be changes in structure.
names!
wouldn't work, butnames
swapping out of place would. Since you'd be just creating a new NamedTuple of pointers, that wouldn't have a meaningful performance regression since no actual arrays are ever created. In fact, sinceDataFrame
is already astruct
, a lot of the usage on it already doesn't assume it's mutable, so making its column setup immutable shouldn't be all that much work.The only real difficulty is compiler strain. Julia 1.0 changed a lot of the tuple optimizations, so 100 tuples is usually fine as long as you don't splat now. However, 1000 tuples? I don't know (it might still be better for compile times since function barriers and inference failures also hurt compile times).
For this reason I think making a switchable backing is an option. The way I would do this is as follows. Allow an option
typedcols=false
in the DataFrame construction, which if true makes it be a NamedTuple{...}. Add agetproperty
implementation such that, if it's a NamedTuple thencolindex
just uses the type information from the NamedTuple, otherwise it uses a stored vector (and in the NamedTuple case that would just benothing
). Or there could be aTypedDataFrame
.Anyways, I see a use case for both of them because of the column number issue, but I'm not sure huge numbers of columns is the standard use case, so I'm not sure about the defaults and the choice to specialize on arbitrary size. But, looking at the implementation, I don't think a different backing would be that hard to do.
The text was updated successfully, but these errors were encountered: