NamedTuple backing or switchable? #1949

ChrisRackauckas · 2019-09-08T00:24:20Z

https://github.com/JuliaData/DataFrames.jl/blob/master/src/dataframe/dataframe.jl#L95

That choice seems to be a cause of almost all of the performance issues going on with DataFrames.jl. Basically, if you use anything out of a DataFrame you need a function barrier or a type assertion or it won't be inferred. Ouch. But it doesn't truly seem like it's necessary. So let's reason a little bit what a NamedTuple backing would cause to be done differently.

If that was a NamedTuple, then what would be set in stone would be the column names and their type. Everything coming out would be inferred given constant prop from the name. This would mean use with literals would be fast, while looping over all columns would fall back to the current speed. Since with this kind of data, names actually matter (i.e. some columns might be strings, so the mean of all columns doesn't really make sense), I think that fallback case would be rare.

Because everything would still be mutable arrays under the hood, mutating values would all still work the same. The only difference would be changes in structure. names! wouldn't work, but names swapping out of place would. Since you'd be just creating a new NamedTuple of pointers, that wouldn't have a meaningful performance regression since no actual arrays are ever created. In fact, since DataFrame is already a struct, a lot of the usage on it already doesn't assume it's mutable, so making its column setup immutable shouldn't be all that much work.

The only real difficulty is compiler strain. Julia 1.0 changed a lot of the tuple optimizations, so 100 tuples is usually fine as long as you don't splat now. However, 1000 tuples? I don't know (it might still be better for compile times since function barriers and inference failures also hurt compile times).

For this reason I think making a switchable backing is an option. The way I would do this is as follows. Allow an option typedcols=false in the DataFrame construction, which if true makes it be a NamedTuple{...}. Add a getproperty implementation such that, if it's a NamedTuple then colindex just uses the type information from the NamedTuple, otherwise it uses a stored vector (and in the NamedTuple case that would just be nothing). Or there could be a TypedDataFrame.

Anyways, I see a use case for both of them because of the column number issue, but I'm not sure huge numbers of columns is the standard use case, so I'm not sure about the defaults and the choice to specialize on arbitrary size. But, looking at the implementation, I don't think a different backing would be that hard to do.

The text was updated successfully, but these errors were encountered:

bkamins · 2019-09-08T03:57:09Z

See #1335.

In particular #1335 (comment) (it is still the case on Julia 1.2) and #1335 (comment) about possible design ideas.

since DataFrame is already a struct

It can be a struct because we have mutable fields in it.

For this reason I think making a switchable backing is an option.

I agree. This is what @nalimilan was working on some time ago. I am not sure what is the state of this work right now. I am focusing on DataFrames.jl API level changes so if someone would be willing to go back to this issue there should not be effort overlap.

Random thoughts of things to consider:

how to use AbstractIndex (we would have a duplicate information about column names).
should we switch DataFrame type to be mutable (to allow replacement of the "column storage container").
should we make DataFrame parametric or have two separate types.
There are parts of code that silently assume DataFrame is Vector{AbstractVector} based - they would have to be tracked down (this should not be too hard).

nalimilan · 2019-09-08T16:35:55Z

Closing in favor of #1335.

ChrisRackauckas · 2019-09-08T16:40:02Z

Thanks! Great to see this is being attempted!

nalimilan closed this as completed Sep 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NamedTuple backing or switchable? #1949

NamedTuple backing or switchable? #1949

ChrisRackauckas commented Sep 8, 2019

bkamins commented Sep 8, 2019

nalimilan commented Sep 8, 2019

ChrisRackauckas commented Sep 8, 2019

NamedTuple backing or switchable? #1949

NamedTuple backing or switchable? #1949

Comments

ChrisRackauckas commented Sep 8, 2019

bkamins commented Sep 8, 2019

nalimilan commented Sep 8, 2019

ChrisRackauckas commented Sep 8, 2019