-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic performance problems #523
Comments
One extremely radical suggestion is to change DataArray's so that |
Wouldn't a getindex that throws exception on NA be about as performant as undefined behaviour? |
Great summary, John. Handling the DataArrays typing issue is probably best done with isna/value as you outlined in JuliaStats/DataArrays.jl#71. For the DataFrames column indexing issue, there are a couple of options. The simplest is just to have a macro helper. For example, @expose df a b could be shorthand for: a, b = df[:a], df[:b] Another option is function dot3(df::DataFrame)
@with df begin
x = 0.0
for i in 1:length(:a)
x += value(:a)[i] * value(:b)[i]
end
end
return x
end This could be done along with |
I think we can throw exceptions just as efficiently. |
@tshort, will the method you're proposing be type-inferrable? I think the main problem is that Julia compiles functions en-masse and the types of the two columns can't be known from the input type of |
You're right on the type inference, John. That won't fix that. |
Other than using the approach outlined in #471, I can't think of a way around the type inference issue for columns that doesn't require new Julia language features. |
Actually, maybe |
Re: @ivarne's comment, unfortunately I think throwing errors on NA would cost us at least a little bit because it would make Another thing to consider is that the bounds checks for the function dot9(da::DataVector, db::DataVector)
x = 0.0
for i in 1:length(da)
if !(Base.getindex_unchecked(da.na.chunks, i) || Base.getindex_unchecked(db.na.chunks, i))
x += values(da)[i] * values(db)[i]
end
end
return x
end On my machine, this is nearly twice as fast as In the long term, it would be great if Julia could create more efficient code for the "naive" variants of these functions. I have a couple ideas for this. First, type inference can already determine that |
I'm wondering whether it wouldn't be simpler to use bitpatterns to signal NAs. All type inference issues would go away with such a design... |
@nalimilan This is only possible for floating point DataArrays, and I created an issue for it a while back (JuliaStats/DataArrays#5). For Bools, Ints, Strings, and other types, there are no bit patterns available that aren't otherwise valid. For DataFrames there's also the second problem of knowing the types of columns, which bit patterns can't fix. |
Or, similarly, some sort of option to return a distinct value of the user's On Sun, Feb 2, 2014 at 4:43 PM, Milan Bouchet-Valat <
|
I coded an function dot9(df::DataFrame)
@with df begin
x = 0.0
for i in 1:length(:a)
x += values(:a)[i] * values(:b)[i]
end
x
end
end Here is a gist with the code for |
@tshort: Your approach is really cool. What do you think of creating a DataFramesMeta package to explore experimental applications of metaprogramming to DataFrames? |
@johnmyleswhite, I'll create the metaprogramming package. Is that appropriate under JuliaStats, or should I just put it under my own account? |
See the following for DataFramesMeta package that includes https://github.com/tshort/DataFramesMeta.jl It's already better than the expression-based approaches we had. There are plenty of options to expand this. It needs more eyeballs to see where the kinks are. This package isn't in METADATA. |
Thanks so much for doing this. I'm so much happier having this functionality in a place where it can grown until its mature enough to merge into DataFrames. I'd be cool with moving this package into JuliaStats as long as we advertise right at the start that the package is experimental. |
Looks great. |
Tom, you've got the required permissions now. |
I am new to Julia, still not getting many details of the varying proposed solutions I read, including the @with macro from @tshort, still lots to learn for me on macros, expressions, symbols, etc, so please forgive me if my following attempt at offering 2 cents is lame. My understanding is there are 2 problems:
So here are my practical views on those. 1- NA overhead. 1.1- 1.2- 2- Type inference. So to make a long story short here, we need to expose the column value types, but we don't have to expose each of them, we just need a practical approach to parametrize a handful.
Those types would be very custom to each case scenario, i.e., each data frame, it comes with the specifics of the data frame at hand for which we know which columns those types refer to. Then in your code, you could annotate you array accordingly, and performance would follow. As an example, below are 2 dot functions to exemplify this approach, Does it make sense?
dot0 is the same as dot1 with type T explicit in Vector (does not help performance), dot42 is the same as John's dot 4, dot40 & dot41 not presented here are variants of dot4's indirection resolving type inference through dot0 & dot1 with direct access to column arrays.
|
Combining the above NA & Typed DataFrame strategies, I think we'd end up with something like:
My updated performance tests look like as follows.
0- Vectors type as function param: 0.012684 We can simplify a bit the parametric DataFrame definition by removing the dummy variable for each type of interest as the type annotation suffices -- we just need to parametrize the type accessors function
We also don't need the type accessors ( This strategy does not suffer from creating new data frames for every column addition or deletion because the exposed types are conceptual, they apply to groups of columns, whatever makes sense for every data frame. To get optimized speed with this strategy, the usage extra efforts are:
For the one a bit familiar with macros, I suppose defining a I am afraid this is non-sense to the language purist but for practical purposes, for someone like me who likes explicit type annotation in the first place for type clarity despite being slightly more verbose, such a solution would work like a charm to get optimized performance in a practical way but without the ugliness of passing around dummy variables just for type information. My questions are:
I know it is a bit long but I am trying to be as clear as possible since I believe it would be an acceptable solution for me. Again sorry if it is non-sense noise. Unless it is such a huge bad idea, I would appreciate any help in defining a macro to do the job, perhaps starting with the container model by lack of a better one for now -- defining macros is one of my next priorities to learn... Thanks. |
Thanks for your ideas. You've got a lot of interesting points. I've been suffering from a bout of RSI lately, so I can't respond to a message as long as yours in full. Instead, I'll try to be brief. We have thought a lot about the issues you're raising. I would strongly encourage you to do more reading in this repo and in DataArrays, rather than try to implement something yourself without a full understanding of how we've been thinking about these concerns. In that vein, here is a very high-level overview of our strategy:
|
@johnmyleswhite Sorry for my lengthy notes, I just wanted to express my points as clearly as possible.. I definitely have a lot to learn in Julia, I will keep on reading when I can. All the enhancements you mention sound great. But I need a stop-gap approach in 0.3 until we have a stable release with all of those great features that will help resolve, in particular, the type inference issue, the performance bottleneck. I have plenty of questions, way too many and not appropriate to discuss in this ticket. Thanks for all your contributions in the Julia community. |
There are two stopgaps:
|
It would be valuable to re-run some of the original benchmarks here w/ the transition from DataArray to Vector{T}/Vector{Union{T, Null}}, particularly on 0.7 current master (9/7/2017) where performance optimizations have been implemented for Vector{Union{T, Null}} where |
The example below does a good job of capturing where our current interface is making things excessively difficult for type-inference to do its work:
This gives the following results, which are pretty representative of other runs on my laptop:
Until we come up with something better, we might want to embrace the element-wise
isna
function I added as well as something like thevalues
function I've defined here. (Which, likeisna
, was an idea that Tim Holy proposed.) That would at least get us to a sane place when using DataArrays.If we follow my proposal to implement categorical data using an enum-type stored in a DataArray, we only need to optimize the performance of DataArrays going forward.
That still leaves a question above how to optimize the DataFrames case. Needing to introduce indirection to get proper type inference and good performance isn't a great long-term strategy.
The text was updated successfully, but these errors were encountered: