-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support type-stable map and combine on GroupedDataFrame #1601
Conversation
…gument This is inspired by the JuliaDB API. When columns are specified via select, the user-provided function is passed column vectors (as SubArray) rather than a SubDataFrame. This gives fully type-stable code and dramatically improves performance.
src/groupeddataframe/grouping.jl
Outdated
elseif select isa Symbol || select isa Integer | ||
incols = (gd.parent[select],) | ||
else | ||
incols = Tuple(columns(gd.parent[collect(select)])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not optimal in general but I guess it is good enough as I would assume that this is not the costly operation relative to the whole work that has to be done and select
is has few elements.
A general quick comment that the proposed solution is good if
I will have a more detailed look at the PR later. |
I think that keeping When reading the code I realized that we can do yet another optimization. Probably for another PR, but related. We could check if a data frame is sorted on columns on which we group and if yes then use an implementation that takes advantage of the fact. I do not know if it was discussed, but even if we called |
Yes, Regarding the API, I realize JuliaDB supports two different syntaxes: |
This is exactly what I mean and I suppose (have not tested) that it even justifies running The |
Wow this is truly excellent. I'm excited to look this over more closely when I try and port this logic to DataFramesMeta. Thanks for the work on this! One note about syntax, I think a lot of the DataFrames syntax is motivated by the fact that you don't want to ever do
pattern that I like in Stata. I have in fact recently required |
Really nice job! Concerning the JuliaDB API, as far as I can tell, both this approaches work in JuliaDB: julia> IndexedTables.groupby(:mean => :SepalWidth => mean, t, :Species);
julia> IndexedTables.groupby(mean, t, :Species, select = :SepalWidth); I think there was a API rewrite at some point and it was decided that most operations ( The "pair" syntax is a consequence of the selection API, so basically instead of passing a function you are passing a combination of a selection and a function. In terms of syntactic sugar, JuliaDBMeta provides both a
and a There is a difference however between JuliaDB and DataFrames here that I am only realizing now, in that passing a select with a tuple of symbols in JuliaDB is just an optimization, i.e. both of the following work: julia> IndexedTables.groupby(v -> (m = mean(column(v, :SepalLength)), s = std(column(v, :SepalWidth))), t, :Species)
julia> IndexedTables.groupby(v -> (m = mean(column(v, :SepalLength)), s = std(column(v, :SepalWidth))), t, :Species, select = (:SepalLength, :SepalWidth)) Whereas from what I understand, here the anonymous function would take as many arguments as you pass with by((:SepalLength, :SepalWidth) => (x, y) -> (m = mean(x), s = std(y)), t, :Species) in combination with a macro s.t. A final remark is that the pair selection syntax is probably more general in that it could also be used on the variable you are grouping by, to allow grouping by something that is not a column, i.e. by(:SepalLength => _ -> (m = mean(_),), t, :Species => _ -> _ == :viridis) |
Just to expand on what the function should accept. After thinking of it I get to prefer |
I'm also a bit sceptical of the "many arguments" approach. The In an ideal world I imagine one could have two versions of |
Actually this is what I want to put on a table very soon (but first I have to get |
OK. Let's use the pair approach rather than Regarding what the function should be passed, I'm not sure
|
OK - let us move forward with what you propose. When it is implemented I will checkout the PR and test it to see how it goes 😄. Thanks |
I think I understand the design a bit better now, and I believe there really are two separate concerns. There is the concern of a "typed selection" API, i.e. a way to select a fully typed subset of columns, which, to be consistent with JuliaDB, could be done as follows: select(x, :SepalLength) # select with a symbol -> get a vector
select(x, (:SepalLength, :SepalWidth) # select with a Tuple -> get the NamedTuple of columns, or "type stable DataFrame" And ideally many functions could accept map(f, df, select = ...) = map(f, DataFrames.select(df, select)) And then there is the issue of describing functions that operate on a few columns, which is orthogonal to this. So one could have a default: apply(f, df) = f(df)
used_cols(f, df) = names(df) and it could have many useful specializations. For example Then, for each of this overloads to EDIT: it has just occurred to me that if we decide to pass a by((:col1, :col2) => ((x,y),) -> mean(x) + mean(y), ...) so actually this option is slightly more general than the other. |
I'm not a fan of this because I also want to note that I find it very common to perform the same action on a variety of columns, something like
Would be ideal. (though this doesn't address the issue of column names) |
A possibility would be that we mimick @nalimilan how do you see this? |
Now I see that my proposal does not solve the recompilation issue introduced by
But I am not 100% sure here. Maybe typical use cases are 1 or 2 columns, when using positional arguments makes more sense - |
Also we should provide users with ability to know source column names (in case they wanted to dynamically generate column names in return value) and without a named tuple (or some equivalent, e.g. splatted keyword arguments) how do you plan to allow for this? |
Clever. Actually it looks like that's what JuliaDB does. I need to think about it, but that's appealing, and it would replace
You can create a keyword argument call from a symbol very easily, e.g.
Hmm, that interpretation of this syntax would conflict with other proposals and would prevent specifying functions which take several argument. I think you can do that with |
It wouldn't work if you have different subsets of columns you wanted to act on. Maybe a column has
This is really interesting! I will have to look into how this could be done with DataFramesMeta as well |
OK, I've pushed a commit to switch to the The pair needs to be passed as the first argument, even if the EDIT: actually, this doesn't pass a |
I think this is getting ready for a review. In particular, please have a look at the API. |
src/groupeddataframe/grouping.jl
Outdated
last(p)(NamedTuple{incols}(map(c -> x[c], incols))) : | ||
last(p)(x[incols]) | ||
if res isa Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractVector, AbstractMatrix} | ||
throw(ArgumentError("a single value result is required when passing a vector or tuple " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we do not allow returning AbstractVector
?
I can agree that we do not allow other forms (and this answers my question above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as an important problematic case - why don't you want to accept returning an AbstractVector
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make my life simpler. :-)
But now that everything has settled, I've added that feature.
I have went through this. A massive work - thank you @nalimilan. The major concern I had is API inconsitency between |
There are a few unresolved things, but they are not very problematic I guess. I agree that this PR is complex-enough and I think that we should merge it when the small issues are discussed to move forward. The things to do in the future would be:
This in general means that the question is if we write somewhere in the documentation that the |
I agree with the agenda. But I don't think we need to write anywhere that the functions will continue to evolve, that's OK as long as we don't plan on breaking them. We can always mention it in the release post. Also I think I've covered several corner cases of mixing return values of different types, shapes or lengths. I've pushed the commit with the fixes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - let's get rolling. I have left two questions relating to what f
can be if it is a first argument to combine
and by
.
In general probably we might be mildly breaking in the future as the subject is very complex (like in the case of combining levels categorical arrays JuliaData/CategoricalArrays.jl#172).
E.g. we currently support map(cols=>f, gdf)
and probably it should be dropped if we decide to introduce transform
(if we decide to add all the methods to map
then the question is if by
and combine
accept a Pair
as a first argument - if yes then we will not be breaking).
This is inspired by the JuliaDB API. When columns are specified via
select
, the user-provided function is passed column vectors (asSubArray
s) rather than aSubDataFrame
. This gives fully type-stable code and dramatically improves performance.This relatively limited change allows taking full advantage of the recent refactoring (#1520). The performance gain is really incredible: about 200× for a simple sum after compilation.
I still need to write tests, but otherwise the main question is the API. The discussion has already been started on Discourse. To sum up: while the
select
API illustrated above is reasonable, the slow variant is a bit shorter and more intuitive, in particular because it avoids repeating the column names in two different places. Yet we really don't want people to use the slow variant unless they really need to (for complex operations or when the number of columns isn't known in advance), since its terribly slow.The solution I proposed to that problem is to detect whether the user-provided function expects a
SubDataFrame
or columns, assuming the argument names are those of columns. This would allow writing the example above as simplycombine(b -> sum(b), gd)
. Implementation-wise, it is actually quite easy to extract the argument names and match them to column names. The main problem is that when the function takes a single argument, we don't know whether it's supposed to be aSubDataFrame
or a column. So far I haven't found a good solution to that, except writingcombine((b, args...) -> sum(b), gd)
or something silly like this.Anyway, we don't necessarily need to decide this right now, as the
select
approach could be complementary to another more convenient approach:select
can be useful to choose programmatically which column(s) to operate on, and a convenience syntax could be based on that for its implementation.Cc: @bkamins, @piever, @pdeffebach