-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the way grouped transforms work #101
Change the way grouped transforms work #101
Conversation
This is likely the function to use from DataFrames
The way it deals with |
As far as I understand its main difference from |
This type promotion is exactly what we want, except for the splatting, which will be slow for high numbers of groups. It would be great if this function existed for arbitrary collections. |
Something like
I think that since |
This is proving to be a pain.
I almost want to do PRs to |
@nalimilan do you have a second to have look at it? You probably have the answer ready having designed and implemented all this 😄. Thanks. |
For clarity, this is the super contrived example I am imagining.
We want a command that
|
Just got a chance to work on this. I think all my worrying about types with categorical arrays is overblown! As far as I can tell I think this is ready for another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the comments - it would be nice to add several tests. Also this implementation will probably loose performance but for me it is acceptable.
src/DataFramesMeta.jl
Outdated
|
||
function spread_scalar(x::CategoricalArrays.CategoricalValue, length::Int) | ||
vec = CategoricalArray(fill(x, length)) | ||
levels!(vec, CategoricalArrays.index(CategoricalArrays.pool(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this line is needed? I understand that without it vec
has only one level? Yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a CategoricalArray[1,2,1,2,1]
, I was thinking that if you take the first element of each group, filling in the rest of the group with 1 value, we still want to have the new vector have the same levels as the original one.
Without that line you also get this weird case where the pool
of the array and the pool
of the elements are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I thought. Just wanted to be sure.
src/DataFramesMeta.jl
Outdated
function transform(g::GroupedDataFrame; kwargs...) | ||
result = DataFrame(g) | ||
idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)]) | ||
idx1 = [1; 1 + idx2[1:end-1]] | ||
|
||
function spread_scalar(x::Vector, length::Int) | ||
return x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if x
does not have length length
(also maybe not use length
as name as it clashes with length
function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point on changing the length
name.
length
is the number of observations in each group. So we are telling the function how many times to replicate the result of a vector -> scalar
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is why I am asking - in line 395 you assume that v
returned a vector not a scalar and you keep it unchanged. What if the length of this vector is not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just get a
ERROR: ArgumentError: New columns must have the same length as old columns
Perhaps we should add an error about only doing only vector -> vector (right length)
or vector -> scalar
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will not when by accident the total length of vectors is OK, but the vectors themselves are not of correct length (e.g. you have two groups of length 3, 6 in total, and vectors are of lengths 2 and 4). I think we should strictly check that if a Vector
is returned it has the length equal to the length of the group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so
length(x) == obs_in_group ? x : error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍.
src/DataFramesMeta.jl
Outdated
@@ -387,17 +386,29 @@ function transform(d::Union{AbstractDataFrame, AbstractDict}; kwargs...) | |||
return result | |||
end | |||
|
|||
|
|||
function transform(g::GroupedDataFrame; kwargs...) | |||
result = DataFrame(g) | |||
idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)]) | |||
idx1 = [1; 1 + idx2[1:end-1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think idx1
and idx2
are not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I got a notification of your following comment:
Sorry maybe I'm wrong but what am I doing? We should just call
@based_on
, because it seems to do the exact thing we want.
But I cannot find it. Have you deleted it and if not what did you refer to?
src/DataFramesMeta.jl
Outdated
idx1 = [1; 1 + idx2[1:end-1]] | ||
|
||
function spread_scalar(x::Vector, obs_in_group::Int) | ||
length(x) == obs_in_group ? x : error("Functions must return either a vector the same length as each groupor a scalar") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can create the error message in the line above and pass it as a variable to avoid overlong line.
Additionally this statement is a bit imprecise (and there is no space between group and or).
What I mean by imprecise is that actually we accept non-scalars, e.g. matrices, and we would repeat them as if they were scalars.
Maybe it is better to write just that: "If a function returns a vector it must have the same length as a group"
(or something like this - I am not a native speaker so maybe there is a better way to word it 😞)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
Apologize. the Very excited to get |
src/DataFramesMeta.jl
Outdated
if length(x) == obs_in_group | ||
return x | ||
else | ||
errormessage = "If a function returns a vector, it must have the" * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space is missing after the (actually if you do if-else you do not have to create the variable - just put the string inside error message).
Also using error
is not supported in Julia 1.0, I would use throw(DimensionError(" ....
as this is exactly the problem here.
src/DataFramesMeta.jl
Outdated
|
||
for (k, v) in kwargs | ||
spreading_helper = x -> spread_scalar(v(x), size(x, 1)) | ||
result[k] = reduce(vcat, spreading_helper(ig) for ig in g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it is better than simply:
result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g)
Excellent. I have left small comments. Can you copy-paste here the result of |
Just added those things. w.r.t. describe, I have to make some PRs to DataFrames to get it working. I was just saying that this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
src/DataFramesMeta.jl
Outdated
end | ||
end | ||
|
||
function spread_scalar(x::CategoricalArrays.CategoricalValue, obs_in_group::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFramesMeta currently doesn't depend on CategoricalArrays. I guess the best solution is to implement fill(x::CatValue, dims)
in CategoricalArrays so that the generic method below also works here.
But actually, I'm not even sure calling fill
for scalars is a good idea. Given my comment below about the missing vcat
optimization, I think you'd better return (x for x in 1:obs_in_group)
, so that you avoid allocating an unnecessary vector. Anyway vcat
(currently) chooses the type of the returned vector according to the type of its first argument, and it really won't take into account whether other arguments are CategoricalArray
or not. If you want to support that you need to call similar
on the first vector, with the type of the first returned entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, (x for x in 1:obs_in_group)
isn't indexable, so my suggestion won't work. I guess the best solution is to have an if x isa AbstractArray
branch in the for
loop below, rather than dispatching on methods.
src/DataFramesMeta.jl
Outdated
for i in 2:length(g) | ||
result[idx1[i]:idx2[i], k] = v(g[i]) | ||
|
||
function spread_scalar(x::Vector, obs_in_group::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::AbstractVector
and ::Integer
would be better (same below for the latter).
Also, maybe broadcast_scalar
, repeat_scalar
or recycle_scalar
would use a more common terminology?
src/DataFramesMeta.jl
Outdated
if length(x) == obs_in_group | ||
return x | ||
else | ||
throw(DimensionError("If a function returns a vector, the result " * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to print the expected length and the actual length, as it can make debugging much easier.
test/grouping.jl
Outdated
@@ -21,4 +21,8 @@ d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 3, | |||
g = groupby(d, :x, sort=true) | |||
@test @based_on(g, nsum = sum(:n))[:nsum] == [99, 84, 27] | |||
|
|||
d = DataFrame(a = [1,1,1,2,2], b = [1,2,3,missing,missing]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you expect a particular behavior for CategoricalArray
, it should be tested (and a dependency on it be added only for testing).
src/DataFramesMeta.jl
Outdated
@@ -378,7 +378,6 @@ end | |||
## transform & @transform | |||
## | |||
############################################################################## | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to remove this line, nor to add one below?
src/DataFramesMeta.jl
Outdated
end | ||
|
||
for (k, v) in kwargs | ||
result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory using a generator as you do is better, but in practice I'm afraid that's going to be slow because the optimized reduce(vcat, ...)
method from JuliaLang/julia#27188 doesn't exist for generators. For a large number of groups, allocating a new copy for each new entry is going to kill performance.
I've filed JuliaLang/julia#28691 to track this. In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.
Thanks. The original problem is that if the first group returns [1,2,3]
and the second group returns [missing, missing, missing]
, there is an error because the original vector allocated would be of type Int
.
Do you want me to re-work this PR so that promote_type
works better and we allocate the right vector the first time?
Or should we put this PR on hold and wait for vcat
to work better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Then the solution is to do something like mapreduce(eltype, promote_type, A)
to choose the best element type, as in JuliaLang/julia#27188.
I don't think we can wait for vcat
to improve, since it's not clear when it will happen (maybe it needs to wait for 2.0 to avoid breakage). See JuliaLang/julia#18472.
Based on the most recent comments, I think I might rebase and start anew. To recap: the goal was to fix three problems relating to
I thought all of these problems could be avoided, and the code would be simpler, by
My current proposal for this is to keep the current set up (allocating a vector of a certain type, then filling it in as needed), but with two changes.
We can worry about the behavior of |
Good plan, @pdeffebach. |
dba551c
to
649e78f
Compare
Continue working Better error message and no allocation for iterator Reduce number of changes, use mapreduce for T Go back to idx Add if-else for scalars
649e78f
to
1965c7b
Compare
I successfully rebased and squashed the commits.. I think. The PR now looks more similar to what the existing code was. For each key-value pair in I don't actually use dispatch on this. Rather, I just use plain old |
Sorry if I missed this in my previous comments, but there's a problem with that approach: you compute transformations twice, which is going to be much slower. The solution which is generally used these days (by I'm currently preparing a PR using this strategy for |
Actually, there's a very interesting challenge here which could give incredible performance with some transformations to how the code works. If instead of passing Now, the |
Thanks for the feedback. It looks like I had two misonceptions
I will try the approach you showed me. On the other hand, this kind of operation could arguably be in |
They are fast in the sense that they do not allocate a vector with all the results. But the downside is that they have to reevaluate the call each time you access them.
Re-typing is expensive since it requires copying all already processed elements. But that's less expensive than other solutions. Hopefully at some point we'll have a way of converting an
Cool!
Actually DataFramesMeta can do things more efficiently than DataFrames since thanks to macros it knows which columns are involved in a computation and can specialize on them. OTC DataFrames's |
I meant that |
I would like to see this through so let's add the recursive function. Do you mean something like this?
|
Yes, something like that. Thanks to the |
I just wrote a Recursive implementation that seems good to me (maybe). The good news is that we didn't hurt performance, as seen my comment on the gist above. Given that I probably made a number of errors in this re-write which will hurt performance, I am cautiously optimistic about this. |
src/DataFramesMeta.jl
Outdated
end | ||
t[starts[i]:ends[i]] = current | ||
if i != length(g) | ||
return _transform!(t, i + 1, current, v(g[i+1]), g, v, starts, ends) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that you'd call _transform
only when re-allocating the column. Here you call it for each group, which will trigger a stack overflow for a large number of groups (and is probably slower).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I think I was misunderstanding .
We still have a for-loop. The only difference, really, is that we push the type promotion into a function. Which only ever has to run recursively once or twice.
Thanks for the feedback. I'm still a bit confused though. The implementation I have here is as follows
I hope I am understanding #1520 correctly, because I think this is more or less the same logic used in The issue is that the function
I don't think that works because we need to either increment recursively for all iterations or none of them. If we have a Let me know if this works. Thanks. |
The trick (as in JuliaData/DataFrames.jl#1520) is to have the loop inside |
Ah I see it! |
I think this is following the implementation in #1520, more or less. Performance is better than before without the error checks, and about on par with the error checks. Let me know if there are big performance traps I am stepping in! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks almost ready. You should be able to improve performance in another PR by specializing on the columns (as noted before).
src/DataFramesMeta.jl
Outdated
end | ||
return result | ||
end | ||
|
||
function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector) | ||
# handle the first case | ||
j = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
j
is a bit weird here. I used that name in the other PR because it was an index. Maybe promoted
just like in the other place? Or even better newT
/newtype
(in both places): promoted
sounds like a Boolean.
src/DataFramesMeta.jl
Outdated
end | ||
return result | ||
end | ||
|
||
function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap all lines at 92 chars.
src/DataFramesMeta.jl
Outdated
T = eltype(t) | ||
promoted = promote_type(elout, T) | ||
if (elout <: T || promoted <: T) | ||
t[startpoint:endpoint] = out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better put the return nothing
here for clarity.
src/DataFramesMeta.jl
Outdated
end | ||
|
||
function fill_column_any!(t::AbstractVector, out, startpoint::Int, endpoint::Int) | ||
if (out isa AbstractVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No parentheses.
if (out isa AbstractVector) | |
if out isa AbstractVector |
src/DataFramesMeta.jl
Outdated
elout = eltype(out) | ||
T = eltype(t) | ||
promoted = promote_type(elout, T) | ||
if (elout <: T || promoted <: T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No parentheses:
if (elout <: T || promoted <: T) | |
if elout <: T || promoted <: T |
src/DataFramesMeta.jl
Outdated
T = eltype(t) | ||
promoted = promote_type(typout, T) | ||
if (typout <: T || promoted <: T) | ||
@views t[startpoint:endpoint] .= Ref(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@views
isn't needed on assignment.
@views t[startpoint:endpoint] .= Ref(out) | |
t[startpoint:endpoint] .= Ref(out) |
src/DataFramesMeta.jl
Outdated
for i in (start+1):length(g) | ||
out = v(g[i]) | ||
promoted = fill_column_any!(t, out, starts[i], ends[i]) | ||
if !(promoted === nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !(promoted === nothing) | |
if promoted !== nothing |
src/DataFramesMeta.jl
Outdated
for i in (start+1):length(g) | ||
out = v(g[i]) | ||
promoted = fill_column_vec!(t, out, starts[i], ends[i], size(g[i], 1)) | ||
if !(promoted === nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if !(promoted === nothing) | |
if promoted !== nothing |
@@ -1,7 +1,6 @@ | |||
module DataFramesMeta | |||
|
|||
using DataFrames | |||
|
|||
using DataFrames, Tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-add empty line.
test/grouping.jl
Outdated
# Type promotion | ||
@test (@transform(g, t = isequal(:b[1], 1) ? fill(1, length(:b)) : fill(2.0, length(:b)))[:t] .=== | ||
[1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0]) |> all | ||
# Vectors of different types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Vectors of different types | |
# Vectors whose eltypes promote to Any |
Thanks for the feedback! |
The performance for the vector fill leaves something to be desired, it seems, using the gist here. With this new commit the "fewer vectors" case is ~ 180 on this branch and ~ 150 on master. I'm not sure what's going on, though. |
Maybe add |
I added |
src/DataFramesMeta.jl
Outdated
@@ -405,9 +405,29 @@ end | |||
|
|||
function _transform!(t::AbstractVector, first::AbstractVector, start::Int, | |||
g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector) | |||
@inline function fill_column_vec!(t::AbstractVector, out, startpoint::Int, endpoint::Int, len::Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean when I said you could move the code inside the function is that if you use @inline
, you can just drop the function barrier and put the code directly in the parent function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should probably still be a function, since we call it twice per _transform!
function. Once for the first case and once for the rest. We need to call it twice because we have to compute first
in transform(::GroupedDataFrame,...)
so not treating the first case separately would require calculating first
twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Should be OK as-is then.
src/DataFramesMeta.jl
Outdated
newtype = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1)) | ||
@assert newtype === nothing | ||
newtype_first = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1)) | ||
#@assert newtype_first === nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect this check to be costly. Is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry. no its not.
src/DataFramesMeta.jl
Outdated
end | ||
elout = eltype(out) | ||
T = eltype(t) | ||
newtype = promote_type(elout, T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can make things slightly faster by moving this call after ||
below, that the type is only computed when it's needed. You could even try adding elout === T ||
as a first condition in case it's faster (not sure).
Also elout
sounds like it's an element, not a type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow thats really smart that julia allows for that. But alas there isn't any speed gain from this:
if typout <: T || (newtype = promote_type(typout, T)) <: T
t[startpoint:endpoint] .= Ref(out)
return nothing
else
return newtype
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, too bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
@tshort OK to merge this? It should allow moving on to the next step, which is generating efficient anonymous functions working only on the relevant columns.
+1 for me. Thanks for all the work @pdeffebach, and thanks for the great help and interaction, @nalimilan. |
With this PR, now we allocate a vector using
reduce(v ...generator for all operations on groups...)
however I still have an error with type promotions. Consider the following example:
It seems like
reduce(append!, t(ig) for ig in g
should automatically promote types, but it doesn't.