Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the way grouped transforms work #101

Merged
merged 25 commits into from
Oct 18, 2018

Conversation

pdeffebach
Copy link
Collaborator

With this PR, now we allocate a vector using reduce(v ...generator for all operations on groups...)

however I still have an error with type promotions. Consider the following example:

 df = DataFrame(a = [1,2,3, missing, missing], b = [1,1,1,2,2])
g = groupby(df, :b)
 @transform(g, x = mean(:a))

It seems like reduce(append!, t(ig) for ig in g should automatically promote types, but it doesn't.

@pdeffebach
Copy link
Collaborator Author

@bkamins

This is likely the function to use from DataFrames

https://github.com/JuliaData/DataFrames.jl/blob/56a9b720b9bc61d778fd900c15d6b85309517136/src/abstractdataframe/abstractdataframe.jl#L859

@generated function promote_col_type(cols::AbstractVector...)
    T = mapreduce(x -> Missings.T(eltype(x)), promote_type, cols)
    if CategoricalArrays.iscatvalue(T)
        T = CategoricalArrays.leveltype(T)
    end
    if any(col -> eltype(col) >: Missing, cols)
        if any(col -> col <: AbstractCategoricalArray, cols)
            return :(CategoricalVector{Union{$T, Missing}})
        else
            return :(Vector{Union{$T, Missing}})
        end
    else
        if any(col -> col <: AbstractCategoricalArray, cols)
            return :(CategoricalVector{$T})
        else
            return :(Vector{$T})
        end
    end
end

The way it deals with missings is weird, and maybe is a holdover from before the new small union optimizations. So maybe there is something to be done about this function first while we are at it for this PR.

@bkamins
Copy link
Member

bkamins commented Aug 3, 2018

As far as I understand its main difference from promote_type is that it tries to preserve categorical type if you mix categorical and non-categorical vector.

@pdeffebach
Copy link
Collaborator Author

This type promotion is exactly what we want, except for the splatting, which will be slow for high numbers of groups. It would be great if this function existed for arbitrary collections.

@pdeffebach
Copy link
Collaborator Author

Something like

d = v(gi) for i in g # simplified cause we have a spread_scalar function somewhere
T = DataFrames.promote_col_type(d...)
result[k] = T(reduce(vcat, d))

I think that since DataFrames.promote_col_type is a @generated function, the type-finding step of this is actually quite efficient, with the loop happening at compile time.

@pdeffebach
Copy link
Collaborator Author

This is proving to be a pain.

  1. fill doesn't work with CategoricalArrays quite like it should. You can't get the pool to be the same for the vector as it is for the individual element. Though I will continue looking at constructors code
  2. It's hard to get the DataFrames function above working with a generator or other lazy object.

I almost want to do PRs to CategoricalArrays so that vcat(v(ig) for ig in g) "just works" without worrying about categorical arrays. But I am sure that Categorical Arrays is a lot more complicated otherwise vcat would already just work.

@bkamins
Copy link
Member

bkamins commented Aug 4, 2018

@nalimilan do you have a second to have look at it? You probably have the answer ready having designed and implemented all this 😄. Thanks.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Aug 4, 2018

For clarity, this is the super contrived example I am imagining.

df = DataFrame(id = [1, 2, 3, 1, 2, 3], year = [95, 95, 95, 96, 96, 96], x = rand(6), some_personal_variable = CategoricalArray([1, 2, missing, 1, missing, missing]))

6×4 DataFrames.DataFrame
│ Row │ id │ year │ x         │ some_personal_variable │
├─────┼────┼──────┼───────────┼────────────────────────┤
│ 1   │ 1  │ 95   │ 0.232321  │ 1                      │
│ 2   │ 2  │ 95   │ 0.0617226 │ 2                      │
│ 3   │ 3  │ 95   │ 0.970737  │ missing                │
│ 4   │ 1  │ 96   │ 0.0731739 │ 1                      │
│ 5   │ 2  │ 96   │ 0.555002  │ missing                │
│ 6   │ 3  │ 96   │ 0.963342  │ missing                │

some_personal_variable is time-invariant, like birth location. But we collected it inconsistently. So individual 2 has it for year 95 but not 96. We want to "spread" that value across other years, with a function like collect(skipmissing(:some_personal_variable))[1] on a grouped dataframe. We do this in a transform because we are too lazy to to a by (stata collapse) operation and then perform a m:1 merge on the collapsed data.

We want a command that

  1. Knows that my function returns a scalar and spreads the values into a vector accordingly, which is close but not perfect for CategoricalArrays
  2. Promotes types in the appropriate way, which is non-trivial for CategoricalArrays

@pdeffebach
Copy link
Collaborator Author

Just got a chance to work on this.

I think all my worrying about types with categorical arrays is overblown! As far as I can tell reduce(vcat,...) seems to do type promotion the right way, in the example I described above.

I think this is ready for another review.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the comments - it would be nice to add several tests. Also this implementation will probably loose performance but for me it is acceptable.


function spread_scalar(x::CategoricalArrays.CategoricalValue, length::Int)
vec = CategoricalArray(fill(x, length))
levels!(vec, CategoricalArrays.index(CategoricalArrays.pool(x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this line is needed? I understand that without it vec has only one level? Yes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a CategoricalArray[1,2,1,2,1] , I was thinking that if you take the first element of each group, filling in the rest of the group with 1 value, we still want to have the new vector have the same levels as the original one.

Without that line you also get this weird case where the pool of the array and the pool of the elements are different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I thought. Just wanted to be sure.

function transform(g::GroupedDataFrame; kwargs...)
result = DataFrame(g)
idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)])
idx1 = [1; 1 + idx2[1:end-1]]

function spread_scalar(x::Vector, length::Int)
return x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if x does not have length length (also maybe not use length as name as it clashes with length function).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on changing the length name.

length is the number of observations in each group. So we are telling the function how many times to replicate the result of a vector -> scalar function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is why I am asking - in line 395 you assume that v returned a vector not a scalar and you keep it unchanged. What if the length of this vector is not correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You just get a

ERROR: ArgumentError: New columns must have the same length as old columns

Perhaps we should add an error about only doing only vector -> vector (right length) or vector -> scalar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will not when by accident the total length of vectors is OK, but the vectors themselves are not of correct length (e.g. you have two groups of length 3, 6 in total, and vectors are of lengths 2 and 4). I think we should strictly check that if a Vector is returned it has the length equal to the length of the group.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so

length(x) == obs_in_group ? x : error...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍.

@@ -387,17 +386,29 @@ function transform(d::Union{AbstractDataFrame, AbstractDict}; kwargs...)
return result
end


function transform(g::GroupedDataFrame; kwargs...)
result = DataFrame(g)
idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)])
idx1 = [1; 1 + idx2[1:end-1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think idx1 and idx2 are not needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, sorry.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I got a notification of your following comment:

Sorry maybe I'm wrong but what am I doing? We should just call @based_on, because it seems to do the exact thing we want.

But I cannot find it. Have you deleted it and if not what did you refer to?

idx1 = [1; 1 + idx2[1:end-1]]

function spread_scalar(x::Vector, obs_in_group::Int)
length(x) == obs_in_group ? x : error("Functions must return either a vector the same length as each groupor a scalar")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can create the error message in the line above and pass it as a variable to avoid overlong line.
Additionally this statement is a bit imprecise (and there is no space between group and or).
What I mean by imprecise is that actually we accept non-scalars, e.g. matrices, and we would repeat them as if they were scalars.

Maybe it is better to write just that: "If a function returns a vector it must have the same length as a group" (or something like this - I am not a native speaker so maybe there is a better way to word it 😞)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

@pdeffebach
Copy link
Collaborator Author

Apologize. the @based_on docstring had me confused. If you have two operations, one going to scalar and the other to vector, the scalar result is spread out, which made it seem like transform. But also @based_on drops columns.

Very excited to get describe working with a grouped dataframe, because the dplyr strategy of performing grouped operations just to get summary statistics is annoying.

if length(x) == obs_in_group
return x
else
errormessage = "If a function returns a vector, it must have the" *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space is missing after the (actually if you do if-else you do not have to create the variable - just put the string inside error message).
Also using error is not supported in Julia 1.0, I would use throw(DimensionError(" .... as this is exactly the problem here.


for (k, v) in kwargs
spreading_helper = x -> spread_scalar(v(x), size(x, 1))
result[k] = reduce(vcat, spreading_helper(ig) for ig in g)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is better than simply:

result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g)

@bkamins
Copy link
Member

bkamins commented Aug 13, 2018

Excellent. I have left small comments. Can you copy-paste here the result of describe (or maybe even add it to the tests - so that people see that it can be used this way)?

@pdeffebach
Copy link
Collaborator Author

Just added those things.

w.r.t. describe, I have to make some PRs to DataFrames to get it working. I was just saying that this dpyr-eque workflow often uses grouped operations to get a glance summary statistics, and it doesn't have to be that way.

Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

end
end

function spread_scalar(x::CategoricalArrays.CategoricalValue, obs_in_group::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFramesMeta currently doesn't depend on CategoricalArrays. I guess the best solution is to implement fill(x::CatValue, dims) in CategoricalArrays so that the generic method below also works here.

But actually, I'm not even sure calling fill for scalars is a good idea. Given my comment below about the missing vcat optimization, I think you'd better return (x for x in 1:obs_in_group), so that you avoid allocating an unnecessary vector. Anyway vcat (currently) chooses the type of the returned vector according to the type of its first argument, and it really won't take into account whether other arguments are CategoricalArray or not. If you want to support that you need to call similar on the first vector, with the type of the first returned entry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, (x for x in 1:obs_in_group) isn't indexable, so my suggestion won't work. I guess the best solution is to have an if x isa AbstractArray branch in the for loop below, rather than dispatching on methods.

for i in 2:length(g)
result[idx1[i]:idx2[i], k] = v(g[i])

function spread_scalar(x::Vector, obs_in_group::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

::AbstractVector and ::Integer would be better (same below for the latter).

Also, maybe broadcast_scalar, repeat_scalar or recycle_scalar would use a more common terminology?

if length(x) == obs_in_group
return x
else
throw(DimensionError("If a function returns a vector, the result " *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to print the expected length and the actual length, as it can make debugging much easier.

test/grouping.jl Outdated
@@ -21,4 +21,8 @@ d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 3,
g = groupby(d, :x, sort=true)
@test @based_on(g, nsum = sum(:n))[:nsum] == [99, 84, 27]

d = DataFrame(a = [1,1,1,2,2], b = [1,2,3,missing,missing])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expect a particular behavior for CategoricalArray, it should be tested (and a dependency on it be added only for testing).

@@ -378,7 +378,6 @@ end
## transform & @transform
##
##############################################################################

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to remove this line, nor to add one below?

end

for (k, v) in kwargs
result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory using a generator as you do is better, but in practice I'm afraid that's going to be slow because the optimized reduce(vcat, ...) method from JuliaLang/julia#27188 doesn't exist for generators. For a large number of groups, allocating a new copy for each new entry is going to kill performance.

I've filed JuliaLang/julia#28691 to track this. In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.

Thanks. The original problem is that if the first group returns [1,2,3] and the second group returns [missing, missing, missing], there is an error because the original vector allocated would be of type Int.

Do you want me to re-work this PR so that promote_type works better and we allocate the right vector the first time?

Or should we put this PR on hold and wait for vcat to work better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then the solution is to do something like mapreduce(eltype, promote_type, A) to choose the best element type, as in JuliaLang/julia#27188.

I don't think we can wait for vcat to improve, since it's not clear when it will happen (maybe it needs to wait for 2.0 to avoid breakage). See JuliaLang/julia#18472.

@pdeffebach pdeffebach mentioned this pull request Aug 18, 2018
@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Aug 19, 2018

Based on the most recent comments, I think I might rebase and start anew. To recap: the goal was to fix three problems relating to @transform on a grouped dataframe, which is a grouped operation, but without the collapse.

  1. The original code called eltype on the returned value, meaning that functions returning strings got a char array. So when it tried to fill in that array with the output, an error would be thrown.
  2. The type of the new vector is determined by the first group only, so a function that returned all missing for a group, for example, would throw an error.
  3. Categorical arrays aren't preserved since we allocate a new array.

I thought all of these problems could be avoided, and the code would be simpler, by

  1. Making vector -> scalar functions return a vector via my spread_scalar operation.
  2. Using vcat's type promotion rules to avoid allocating a return vector of a specific type. This seemed especially worthwhile given CategoricalArray's odd type promotion rules.

My current proposal for this is to keep the current set up (allocating a vector of a certain type, then filling it in as needed), but with two changes.

  1. Better type promotion function using reduce as Milan suggested.
  2. Use dispatch to decide if something is
    1. A string
    2. A scalar in general
      And handle those cases separately.

We can worry about the behavior of CategoricaArrays later, and give more thought to what behavior we want upstream with regars to type promotion.

@tshort
Copy link
Contributor

tshort commented Aug 19, 2018

Good plan, @pdeffebach.

@pdeffebach pdeffebach force-pushed the group-transform-overhaul branch from dba551c to 649e78f Compare September 17, 2018 20:20
Continue working

Better error message and no allocation for iterator

Reduce number of changes, use mapreduce for T

Go back to idx

Add if-else for scalars
@pdeffebach pdeffebach force-pushed the group-transform-overhaul branch from 649e78f to 1965c7b Compare September 17, 2018 22:10
@pdeffebach
Copy link
Collaborator Author

I successfully rebased and squashed the commits.. I think. The PR now looks more similar to what the existing code was.

For each key-value pair in kwargs I create a generator (v(gi) for gi in g) and then use that to use mapreduce and find the right type promotion.

I don't actually use dispatch on this. Rather, I just use plain old if...else. Given that none of this is type stable, I don't think its an issue. And I don't really know how I would dispatch the output from the first element of the generator.

@nalimilan
Copy link
Member

Sorry if I missed this in my previous comments, but there's a problem with that approach: you compute transformations twice, which is going to be much slower. The solution which is generally used these days (by map in Base but also in e.g. Tables.jl and IndexedTables.jl) is to allocate a vector based on the eltype of the first result, and then check for each group whether its eltype is a subtype of the vector's eltype. If not, allocate a new vector (choosing the eltype using promote_type), and copy the old data to it. In general that's quite efficient since for a Union{T,Missing} vector only one (partial) copy will be made (and generally it will happen quite soon so not many elements have to be copied).

I'm currently preparing a PR using this strategy for by in DataFrames. The code is a bit more involved since the user can return a DataFrame (and not just one vector per transformation), and because I'm calling separate functions which operate on tuples of resulting columns to ensure some type stability (but not full type stability unfortunately). It would be interesting to try this approach here, but even without the type stability tricks you can take inspiration from it to reallocate vectors when needed. See in particular these lines.

@nalimilan
Copy link
Member

Actually, there's a very interesting challenge here which could give incredible performance with some transformations to how the code works. If instead of passing g[i] to the anonymous function v we passed it directly the columns it operates on, and if we called that from a helper function dedicated to each transformation taking the said columns and applying it to all groups, we would have fully type-stable code and v would probably be inlined. In short, that's the best possible code one could write, even by hand. That would particularly increase performance when the number of groups is large.

Now, the v method taking a tuple of functions already exists, but we need to get the names of the columns it expects. Apparently this can be obtained by extracting the names of arguments via something like Symbol.(getindex.(Base.arg_decl_parts(methods(v).ms[2])[2][2:end], 1)). So it looks like it wouldn't be that hard.

@pdeffebach
Copy link
Collaborator Author

Thanks for the feedback. It looks like I had two misonceptions

  1. I thought generators were faster. I thought it was a way to have Julia strore stuff but not allocate a contiguous array for it.
  2. I assumed array re-typing is expensive. I'm glad to hear its not.

I will try the approach you showed me.

On the other hand, this kind of operation could arguably be in by or something similar. So perhaps eventually DataFramesMeta could just call a by operation.

@nalimilan
Copy link
Member

1. I thought generators were faster. I thought it was a way to have Julia strore stuff but not allocate a contiguous array for it.

They are fast in the sense that they do not allocate a vector with all the results. But the downside is that they have to reevaluate the call each time you access them.

2. I assumed array re-typing is expensive. I'm glad to hear its not.

Re-typing is expensive since it requires copying all already processed elements. But that's less expensive than other solutions. Hopefully at some point we'll have a way of converting an Array{T} to an Array{Union{T,Missing}} without making a copy.

I will try the approach you showed me.

Cool!

On the other hand, this kind of operation could arguably be in by or something similar. So perhaps eventually DataFramesMeta could just call a by operation.

Actually DataFramesMeta can do things more efficiently than DataFrames since thanks to macros it knows which columns are involved in a computation and can specialize on them. OTC DataFrames's by passes a full data frame to the user-provided function, which creates a type instability (JuliaData/DataFrames.jl#1256). But you're right that @by could probably take advantage of the same approach as @transform(g::GroupedDataFrame, ...). AFAICT these are really the same operation.

@pdeffebach
Copy link
Collaborator Author

AFAICT these are really the same operation.

I meant that @transform(g::GroupedDataFrame, ..) spreads the results so that there is no collapse at any point, while all the by operations in DataFrames always collapse so you get a dataframe where each observation is a group. In Base DataFrames I think you would need a by and a join. Which is fine, because thats what DataFramesMeta is for.

@pdeffebach
Copy link
Collaborator Author

pdeffebach commented Oct 2, 2018

I would like to see this through so let's add the recursive function.

Do you mean something like this?

function _transform(first::AbstractVector, g::GroupedDataFrame, v::Function, i::Int, t::Vector, N::Int, starts, ends)
    out = v(g[i])
    # check that its a vector and length is right here...
    S = eltype(out)
    T = eltype(t)
    if !(S <: T || promote_type(S, T) <: T)
       # Problem: We have to calculate v(g[i]) again for however many times we promote
        return _transform(first, g, v, i, Tables.allocatecolumn(promote_type(S, T), N))
    end
    t[starts[i]:ends[i]] = out
    # t= _transform(first, g, v, i+1, t) make it truly recursive? 
    return t 
end 

@nalimilan
Copy link
Member

Yes, something like that. Thanks to the return, you don't need the line commented out at the bottom to make it "truly recursive". But you need to assign out to t before calling _transform, and pass i+1 to it rather than i; that avoids calling v twice.

@pdeffebach
Copy link
Collaborator Author

I just wrote a Recursive implementation that seems good to me (maybe). The good news is that we didn't hurt performance, as seen my comment on the gist above. Given that I probably made a number of errors in this re-write which will hurt performance, I am cautiously optimistic about this.

end
t[starts[i]:ends[i]] = current
if i != length(g)
return _transform!(t, i + 1, current, v(g[i+1]), g, v, starts, ends)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that you'd call _transform only when re-allocating the column. Here you call it for each group, which will trigger a stack overflow for a large number of groups (and is probably slower).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think I was misunderstanding .

We still have a for-loop. The only difference, really, is that we push the type promotion into a function. Which only ever has to run recursively once or twice.

@pdeffebach
Copy link
Collaborator Author

Thanks for the feedback. I'm still a bit confused though. The implementation I have here is as follows

function _transform(t, out, indexes)
    if we need to promote
         make promoted_vector 
         _transform(promoted_vector, out, indexes)
    end
    t[indexes] = out
    return t # otherwise we return a small view
end 

function transform(g, v)
    first = v(g[1])
    allocate array t
     input first value
     for i in 2:length(g)
          t = _transform(t, v(g[i], indexes)
     end
end

I hope I am understanding #1520 correctly, because I think this is more or less the same logic used in combine! there.

The issue is that the function _transform sometimes modifies the input vector t, in the case of no type promotion, and sometimes returns a new vector all together. We also change the type of t throughout the loop, but it seems like that is inevitable. Thankfully Julia assigns by reference so the extra t = _transform isn't allocating. But it needs to be there in the case where we promote type.

But you need to assign out to t before calling _transform, and pass i+1 to it rather than i; that avoids calling v twice.

I don't think that works because we need to either increment recursively for all iterations or none of them. If we have a for loop, we have to stick to a single counter and not call out of order.

Let me know if this works. Thanks.

@nalimilan
Copy link
Member

The trick (as in JuliaData/DataFrames.jl#1520) is to have the loop inside _transform!. You don't change the type of t because you never reassign to it: instead, you call return _transform!(...)` with the newly allocated vector (stored using a different name).

@pdeffebach
Copy link
Collaborator Author

Ah I see it! colstart there means you only restart the loop at the necessary location! That's clever, thanks.

@pdeffebach
Copy link
Collaborator Author

I think this is following the implementation in #1520, more or less. Performance is better than before without the error checks, and about on par with the error checks.

Let me know if there are big performance traps I am stepping in!

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks almost ready. You should be able to improve performance in another PR by specializing on the columns (as noted before).

end
return result
end

function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)
# handle the first case
j = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

j is a bit weird here. I used that name in the other PR because it was an index. Maybe promoted just like in the other place? Or even better newT/newtype (in both places): promoted sounds like a Boolean.

end
return result
end

function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap all lines at 92 chars.

T = eltype(t)
promoted = promote_type(elout, T)
if (elout <: T || promoted <: T)
t[startpoint:endpoint] = out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better put the return nothing here for clarity.

end

function fill_column_any!(t::AbstractVector, out, startpoint::Int, endpoint::Int)
if (out isa AbstractVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No parentheses.

Suggested change
if (out isa AbstractVector)
if out isa AbstractVector

elout = eltype(out)
T = eltype(t)
promoted = promote_type(elout, T)
if (elout <: T || promoted <: T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No parentheses:

Suggested change
if (elout <: T || promoted <: T)
if elout <: T || promoted <: T

T = eltype(t)
promoted = promote_type(typout, T)
if (typout <: T || promoted <: T)
@views t[startpoint:endpoint] .= Ref(out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@views isn't needed on assignment.

Suggested change
@views t[startpoint:endpoint] .= Ref(out)
t[startpoint:endpoint] .= Ref(out)

for i in (start+1):length(g)
out = v(g[i])
promoted = fill_column_any!(t, out, starts[i], ends[i])
if !(promoted === nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !(promoted === nothing)
if promoted !== nothing

for i in (start+1):length(g)
out = v(g[i])
promoted = fill_column_vec!(t, out, starts[i], ends[i], size(g[i], 1))
if !(promoted === nothing)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !(promoted === nothing)
if promoted !== nothing

@@ -1,7 +1,6 @@
module DataFramesMeta

using DataFrames

using DataFrames, Tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-add empty line.

test/grouping.jl Outdated
# Type promotion
@test (@transform(g, t = isequal(:b[1], 1) ? fill(1, length(:b)) : fill(2.0, length(:b)))[:t] .===
[1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0]) |> all
# Vectors of different types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Vectors of different types
# Vectors whose eltypes promote to Any

@pdeffebach
Copy link
Collaborator Author

Thanks for the feedback!

@pdeffebach
Copy link
Collaborator Author

The performance for the vector fill leaves something to be desired, it seems, using the gist here. With this new commit the "fewer vectors" case is ~ 180 on this branch and ~ 150 on master.

I'm not sure what's going on, though.

@nalimilan
Copy link
Member

Maybe add @inline fill_column_vec!? The reason why I used separate functions in my PR is that it allowed specializing on the number of columns to copy. But here since there's a single column there's no advantage and the dynamic dispatch can have some overhead. You could even move them inside the function, the code would be clearer.

@pdeffebach
Copy link
Collaborator Author

I added @noinline and things improved some. It's worth nothing that for the "many groups" case, 5000 as opposed to 1000, there are performance gains of approx. 10%

@@ -405,9 +405,29 @@ end

function _transform!(t::AbstractVector, first::AbstractVector, start::Int,
g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)
@inline function fill_column_vec!(t::AbstractVector, out, startpoint::Int, endpoint::Int, len::Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean when I said you could move the code inside the function is that if you use @inline, you can just drop the function barrier and put the code directly in the parent function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably still be a function, since we call it twice per _transform! function. Once for the first case and once for the rest. We need to call it twice because we have to compute first in transform(::GroupedDataFrame,...) so not treating the first case separately would require calculating first twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Should be OK as-is then.

newtype = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))
@assert newtype === nothing
newtype_first = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))
#@assert newtype_first === nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect this check to be costly. Is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. no its not.

end
elout = eltype(out)
T = eltype(t)
newtype = promote_type(elout, T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can make things slightly faster by moving this call after || below, that the type is only computed when it's needed. You could even try adding elout === T || as a first condition in case it's faster (not sure).

Also elout sounds like it's an element, not a type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow thats really smart that julia allows for that. But alas there isn't any speed gain from this:

        if typout <: T || (newtype = promote_type(typout, T)) <: T
            t[startpoint:endpoint] .= Ref(out)
            return nothing
        else 
            return newtype
        end

Copy link
Member

@nalimilan nalimilan Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, too bad.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@tshort OK to merge this? It should allow moving on to the next step, which is generating efficient anonymous functions working only on the relevant columns.

@tshort
Copy link
Contributor

tshort commented Oct 18, 2018

+1 for me. Thanks for all the work @pdeffebach, and thanks for the great help and interaction, @nalimilan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants