Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NamedTuple backing or switchable? #1949

Closed
ChrisRackauckas opened this issue Sep 8, 2019 · 3 comments
Closed

NamedTuple backing or switchable? #1949

ChrisRackauckas opened this issue Sep 8, 2019 · 3 comments

Comments

@ChrisRackauckas
Copy link

https://github.com/JuliaData/DataFrames.jl/blob/master/src/dataframe/dataframe.jl#L95

That choice seems to be a cause of almost all of the performance issues going on with DataFrames.jl. Basically, if you use anything out of a DataFrame you need a function barrier or a type assertion or it won't be inferred. Ouch. But it doesn't truly seem like it's necessary. So let's reason a little bit what a NamedTuple backing would cause to be done differently.

If that was a NamedTuple, then what would be set in stone would be the column names and their type. Everything coming out would be inferred given constant prop from the name. This would mean use with literals would be fast, while looping over all columns would fall back to the current speed. Since with this kind of data, names actually matter (i.e. some columns might be strings, so the mean of all columns doesn't really make sense), I think that fallback case would be rare.

Because everything would still be mutable arrays under the hood, mutating values would all still work the same. The only difference would be changes in structure. names! wouldn't work, but names swapping out of place would. Since you'd be just creating a new NamedTuple of pointers, that wouldn't have a meaningful performance regression since no actual arrays are ever created. In fact, since DataFrame is already a struct, a lot of the usage on it already doesn't assume it's mutable, so making its column setup immutable shouldn't be all that much work.

The only real difficulty is compiler strain. Julia 1.0 changed a lot of the tuple optimizations, so 100 tuples is usually fine as long as you don't splat now. However, 1000 tuples? I don't know (it might still be better for compile times since function barriers and inference failures also hurt compile times).

For this reason I think making a switchable backing is an option. The way I would do this is as follows. Allow an option typedcols=false in the DataFrame construction, which if true makes it be a NamedTuple{...}. Add a getproperty implementation such that, if it's a NamedTuple then colindex just uses the type information from the NamedTuple, otherwise it uses a stored vector (and in the NamedTuple case that would just be nothing). Or there could be a TypedDataFrame.

Anyways, I see a use case for both of them because of the column number issue, but I'm not sure huge numbers of columns is the standard use case, so I'm not sure about the defaults and the choice to specialize on arbitrary size. But, looking at the implementation, I don't think a different backing would be that hard to do.

@bkamins
Copy link
Member

bkamins commented Sep 8, 2019

See #1335.

In particular #1335 (comment) (it is still the case on Julia 1.2) and #1335 (comment) about possible design ideas.

since DataFrame is already a struct

It can be a struct because we have mutable fields in it.

For this reason I think making a switchable backing is an option.

I agree. This is what @nalimilan was working on some time ago. I am not sure what is the state of this work right now. I am focusing on DataFrames.jl API level changes so if someone would be willing to go back to this issue there should not be effort overlap.

Random thoughts of things to consider:

  1. how to use AbstractIndex (we would have a duplicate information about column names).
  2. should we switch DataFrame type to be mutable (to allow replacement of the "column storage container").
  3. should we make DataFrame parametric or have two separate types.
  4. There are parts of code that silently assume DataFrame is Vector{AbstractVector} based - they would have to be tracked down (this should not be too hard).

@nalimilan
Copy link
Member

Closing in favor of #1335.

@ChrisRackauckas
Copy link
Author

Thanks! Great to see this is being attempted!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants