Change the way grouped transforms work #101

pdeffebach · 2018-08-02T23:45:38Z

With this PR, now we allocate a vector using reduce(v ...generator for all operations on groups...)

however I still have an error with type promotions. Consider the following example:

 df = DataFrame(a = [1,2,3, missing, missing], b = [1,1,1,2,2])
g = groupby(df, :b)
 @transform(g, x = mean(:a))

It seems like reduce(append!, t(ig) for ig in g should automatically promote types, but it doesn't.

pdeffebach · 2018-08-03T04:04:54Z

@bkamins

This is likely the function to use from DataFrames

https://github.com/JuliaData/DataFrames.jl/blob/56a9b720b9bc61d778fd900c15d6b85309517136/src/abstractdataframe/abstractdataframe.jl#L859

@generated function promote_col_type(cols::AbstractVector...)
    T = mapreduce(x -> Missings.T(eltype(x)), promote_type, cols)
    if CategoricalArrays.iscatvalue(T)
        T = CategoricalArrays.leveltype(T)
    end
    if any(col -> eltype(col) >: Missing, cols)
        if any(col -> col <: AbstractCategoricalArray, cols)
            return :(CategoricalVector{Union{$T, Missing}})
        else
            return :(Vector{Union{$T, Missing}})
        end
    else
        if any(col -> col <: AbstractCategoricalArray, cols)
            return :(CategoricalVector{$T})
        else
            return :(Vector{$T})
        end
    end
end

The way it deals with missings is weird, and maybe is a holdover from before the new small union optimizations. So maybe there is something to be done about this function first while we are at it for this PR.

bkamins · 2018-08-03T05:59:49Z

As far as I understand its main difference from promote_type is that it tries to preserve categorical type if you mix categorical and non-categorical vector.

pdeffebach · 2018-08-03T13:04:20Z

This type promotion is exactly what we want, except for the splatting, which will be slow for high numbers of groups. It would be great if this function existed for arbitrary collections.

pdeffebach · 2018-08-03T13:13:54Z

Something like

d = v(gi) for i in g # simplified cause we have a spread_scalar function somewhere
T = DataFrames.promote_col_type(d...)
result[k] = T(reduce(vcat, d))

I think that since DataFrames.promote_col_type is a @generated function, the type-finding step of this is actually quite efficient, with the loop happening at compile time.

pdeffebach · 2018-08-03T23:46:55Z

This is proving to be a pain.

fill doesn't work with CategoricalArrays quite like it should. You can't get the pool to be the same for the vector as it is for the individual element. Though I will continue looking at constructors code
It's hard to get the DataFrames function above working with a generator or other lazy object.

I almost want to do PRs to CategoricalArrays so that vcat(v(ig) for ig in g) "just works" without worrying about categorical arrays. But I am sure that Categorical Arrays is a lot more complicated otherwise vcat would already just work.

bkamins · 2018-08-04T07:09:51Z

@nalimilan do you have a second to have look at it? You probably have the answer ready having designed and implemented all this 😄. Thanks.

pdeffebach · 2018-08-04T14:26:50Z

For clarity, this is the super contrived example I am imagining.

df = DataFrame(id = [1, 2, 3, 1, 2, 3], year = [95, 95, 95, 96, 96, 96], x = rand(6), some_personal_variable = CategoricalArray([1, 2, missing, 1, missing, missing]))

6×4 DataFrames.DataFrame
│ Row │ id │ year │ x         │ some_personal_variable │
├─────┼────┼──────┼───────────┼────────────────────────┤
│ 1   │ 1  │ 95   │ 0.232321  │ 1                      │
│ 2   │ 2  │ 95   │ 0.0617226 │ 2                      │
│ 3   │ 3  │ 95   │ 0.970737  │ missing                │
│ 4   │ 1  │ 96   │ 0.0731739 │ 1                      │
│ 5   │ 2  │ 96   │ 0.555002  │ missing                │
│ 6   │ 3  │ 96   │ 0.963342  │ missing                │

some_personal_variable is time-invariant, like birth location. But we collected it inconsistently. So individual 2 has it for year 95 but not 96. We want to "spread" that value across other years, with a function like collect(skipmissing(:some_personal_variable))[1] on a grouped dataframe. We do this in a transform because we are too lazy to to a by (stata collapse) operation and then perform a m:1 merge on the collapsed data.

We want a command that

Knows that my function returns a scalar and spreads the values into a vector accordingly, which is close but not perfect for CategoricalArrays
Promotes types in the appropriate way, which is non-trivial for CategoricalArrays

pdeffebach · 2018-08-12T00:45:09Z

Just got a chance to work on this.

I think all my worrying about types with categorical arrays is overblown! As far as I can tell reduce(vcat,...) seems to do type promotion the right way, in the example I described above.

I think this is ready for another review.

bkamins

Apart from the comments - it would be nice to add several tests. Also this implementation will probably loose performance but for me it is acceptable.

bkamins · 2018-08-12T22:26:35Z

src/DataFramesMeta.jl

+
+    function spread_scalar(x::CategoricalArrays.CategoricalValue, length::Int)
+        vec = CategoricalArray(fill(x, length))
+        levels!(vec, CategoricalArrays.index(CategoricalArrays.pool(x)))


why this line is needed? I understand that without it vec has only one level? Yes?

If we have a CategoricalArray[1,2,1,2,1] , I was thinking that if you take the first element of each group, filling in the rest of the group with 1 value, we still want to have the new vector have the same levels as the original one.

Without that line you also get this weird case where the pool of the array and the pool of the elements are different.

This is what I thought. Just wanted to be sure.

bkamins · 2018-08-12T22:29:33Z

src/DataFramesMeta.jl

 function transform(g::GroupedDataFrame; kwargs...)
    result = DataFrame(g)
    idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)])
    idx1 = [1; 1 + idx2[1:end-1]]
+
+    function spread_scalar(x::Vector, length::Int)
+        return x 


what if x does not have length length (also maybe not use length as name as it clashes with length function).

Good point on changing the length name.

length is the number of observations in each group. So we are telling the function how many times to replicate the result of a vector -> scalar function.

But this is why I am asking - in line 395 you assume that v returned a vector not a scalar and you keep it unchanged. What if the length of this vector is not correct?

You just get a

ERROR: ArgumentError: New columns must have the same length as old columns

Perhaps we should add an error about only doing only vector -> vector (right length) or vector -> scalar.

You will not when by accident the total length of vectors is OK, but the vectors themselves are not of correct length (e.g. you have two groups of length 3, 6 in total, and vectors are of lengths 2 and 4). I think we should strictly check that if a Vector is returned it has the length equal to the length of the group.

Okay, so

length(x) == obs_in_group ? x : error...

Looks good to me 👍.

bkamins · 2018-08-12T22:32:46Z

src/DataFramesMeta.jl

@@ -387,17 +386,29 @@ function transform(d::Union{AbstractDataFrame, AbstractDict}; kwargs...)
    return result
 end

+
 function transform(g::GroupedDataFrame; kwargs...)
    result = DataFrame(g)
    idx2 = cumsum(Int[size(g[i],1) for i in 1:length(g)])
    idx1 = [1; 1 + idx2[1:end-1]]


I think idx1 and idx2 are not needed?

true, sorry.

bkamins

Also I got a notification of your following comment:

Sorry maybe I'm wrong but what am I doing? We should just call @based_on, because it seems to do the exact thing we want.

But I cannot find it. Have you deleted it and if not what did you refer to?

bkamins · 2018-08-13T07:16:57Z

src/DataFramesMeta.jl

-    idx1 = [1; 1 + idx2[1:end-1]]
+
+    function spread_scalar(x::Vector, obs_in_group::Int)
+        length(x) == obs_in_group ? x : error("Functions must return either a vector the same length as each groupor a scalar")


you can create the error message in the line above and pass it as a variable to avoid overlong line.
Additionally this statement is a bit imprecise (and there is no space between group and or).
What I mean by imprecise is that actually we accept non-scalars, e.g. matrices, and we would repeat them as if they were scalars.

Maybe it is better to write just that: "If a function returns a vector it must have the same length as a group" (or something like this - I am not a native speaker so maybe there is a better way to word it 😞)?

pdeffebach · 2018-08-13T14:49:03Z

Apologize. the @based_on docstring had me confused. If you have two operations, one going to scalar and the other to vector, the scalar result is spread out, which made it seem like transform. But also @based_on drops columns.

Very excited to get describe working with a grouped dataframe, because the dplyr strategy of performing grouped operations just to get summary statistics is annoying.

bkamins · 2018-08-13T18:59:39Z

src/DataFramesMeta.jl

+        if length(x) == obs_in_group 
+            return x 
+        else
+            errormessage = "If a function returns a vector, it must have the" *


space is missing after the (actually if you do if-else you do not have to create the variable - just put the string inside error message).
Also using error is not supported in Julia 1.0, I would use throw(DimensionError(" .... as this is exactly the problem here.

bkamins · 2018-08-13T19:04:35Z

src/DataFramesMeta.jl

+
+    for (k, v) in kwargs
+        spreading_helper = x -> spread_scalar(v(x), size(x, 1))
+        result[k] = reduce(vcat, spreading_helper(ig) for ig in g)


why it is better than simply:

result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g)

bkamins · 2018-08-13T19:06:18Z

Excellent. I have left small comments. Can you copy-paste here the result of describe (or maybe even add it to the tests - so that people see that it can be used this way)?

pdeffebach · 2018-08-15T15:15:44Z

Just added those things.

w.r.t. describe, I have to make some PRs to DataFrames to get it working. I was just saying that this dpyr-eque workflow often uses grouped operations to get a glance summary statistics, and it doesn't have to be that way.

bkamins

Looks good to me.

nalimilan · 2018-08-16T13:58:21Z

src/DataFramesMeta.jl

        end
    end
+
+    function spread_scalar(x::CategoricalArrays.CategoricalValue, obs_in_group::Int)


DataFramesMeta currently doesn't depend on CategoricalArrays. I guess the best solution is to implement fill(x::CatValue, dims) in CategoricalArrays so that the generic method below also works here.

But actually, I'm not even sure calling fill for scalars is a good idea. Given my comment below about the missing vcat optimization, I think you'd better return (x for x in 1:obs_in_group), so that you avoid allocating an unnecessary vector. Anyway vcat (currently) chooses the type of the returned vector according to the type of its first argument, and it really won't take into account whether other arguments are CategoricalArray or not. If you want to support that you need to call similar on the first vector, with the type of the first returned entry.

Actually, (x for x in 1:obs_in_group) isn't indexable, so my suggestion won't work. I guess the best solution is to have an if x isa AbstractArray branch in the for loop below, rather than dispatching on methods.

nalimilan · 2018-08-16T13:58:51Z

src/DataFramesMeta.jl

-        for i in 2:length(g)
-            result[idx1[i]:idx2[i], k] = v(g[i])
+
+    function spread_scalar(x::Vector, obs_in_group::Int)


::AbstractVector and ::Integer would be better (same below for the latter).

Also, maybe broadcast_scalar, repeat_scalar or recycle_scalar would use a more common terminology?

nalimilan · 2018-08-16T13:59:37Z

src/DataFramesMeta.jl

+        if length(x) == obs_in_group 
+            return x 
+        else
+            throw(DimensionError("If a function returns a vector, the result " * 


Would be nice to print the expected length and the actual length, as it can make debugging much easier.

nalimilan · 2018-08-16T14:02:52Z

test/grouping.jl

@@ -21,4 +21,8 @@ d = DataFrame(n = 1:20, x = [3, 3, 3, 3, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 3,
 g = groupby(d, :x, sort=true)
 @test @based_on(g, nsum = sum(:n))[:nsum] == [99, 84, 27]

+d = DataFrame(a = [1,1,1,2,2], b = [1,2,3,missing,missing])


If you expect a particular behavior for CategoricalArray, it should be tested (and a dependency on it be added only for testing).

nalimilan · 2018-08-16T14:07:03Z

src/DataFramesMeta.jl

@@ -378,7 +378,6 @@ end
 ## transform & @transform
 ##
 ##############################################################################
-


No need to remove this line, nor to add one below?

nalimilan · 2018-08-16T14:24:47Z

src/DataFramesMeta.jl

+    end
+
+    for (k, v) in kwargs
+        result[k] = reduce(vcat, spread_scalar(v(ig), size(ig, 1)) for ig in g)


In theory using a generator as you do is better, but in practice I'm afraid that's going to be slow because the optimized reduce(vcat, ...) method from JuliaLang/julia#27188 doesn't exist for generators. For a large number of groups, allocating a new copy for each new entry is going to kill performance.

I've filed JuliaLang/julia#28691 to track this. In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.

In the meantime, better keep the existing approach which creates columns and fills them manually, with a reference to that issue.

Thanks. The original problem is that if the first group returns [1,2,3] and the second group returns [missing, missing, missing], there is an error because the original vector allocated would be of type Int.

Do you want me to re-work this PR so that promote_type works better and we allocate the right vector the first time?

Or should we put this PR on hold and wait for vcat to work better?

OK. Then the solution is to do something like mapreduce(eltype, promote_type, A) to choose the best element type, as in JuliaLang/julia#27188.

I don't think we can wait for vcat to improve, since it's not clear when it will happen (maybe it needs to wait for 2.0 to avoid breakage). See JuliaLang/julia#18472.

pdeffebach · 2018-08-19T00:07:47Z

Based on the most recent comments, I think I might rebase and start anew. To recap: the goal was to fix three problems relating to @transform on a grouped dataframe, which is a grouped operation, but without the collapse.

The original code called eltype on the returned value, meaning that functions returning strings got a char array. So when it tried to fill in that array with the output, an error would be thrown.
The type of the new vector is determined by the first group only, so a function that returned all missing for a group, for example, would throw an error.
Categorical arrays aren't preserved since we allocate a new array.

I thought all of these problems could be avoided, and the code would be simpler, by

Making vector -> scalar functions return a vector via my spread_scalar operation.
Using vcat's type promotion rules to avoid allocating a return vector of a specific type. This seemed especially worthwhile given CategoricalArray's odd type promotion rules.

My current proposal for this is to keep the current set up (allocating a vector of a certain type, then filling it in as needed), but with two changes.

Better type promotion function using reduce as Milan suggested.
Use dispatch to decide if something is
1. A string
2. A scalar in general
  And handle those cases separately.

We can worry about the behavior of CategoricaArrays later, and give more thought to what behavior we want upstream with regars to type promotion.

tshort · 2018-08-19T01:22:55Z

Good plan, @pdeffebach.

Continue working Better error message and no allocation for iterator Reduce number of changes, use mapreduce for T Go back to idx Add if-else for scalars

pdeffebach · 2018-09-17T22:15:36Z

I successfully rebased and squashed the commits.. I think. The PR now looks more similar to what the existing code was.

For each key-value pair in kwargs I create a generator (v(gi) for gi in g) and then use that to use mapreduce and find the right type promotion.

I don't actually use dispatch on this. Rather, I just use plain old if...else. Given that none of this is type stable, I don't think its an issue. And I don't really know how I would dispatch the output from the first element of the generator.

nalimilan · 2018-09-18T08:34:56Z

Sorry if I missed this in my previous comments, but there's a problem with that approach: you compute transformations twice, which is going to be much slower. The solution which is generally used these days (by map in Base but also in e.g. Tables.jl and IndexedTables.jl) is to allocate a vector based on the eltype of the first result, and then check for each group whether its eltype is a subtype of the vector's eltype. If not, allocate a new vector (choosing the eltype using promote_type), and copy the old data to it. In general that's quite efficient since for a Union{T,Missing} vector only one (partial) copy will be made (and generally it will happen quite soon so not many elements have to be copied).

I'm currently preparing a PR using this strategy for by in DataFrames. The code is a bit more involved since the user can return a DataFrame (and not just one vector per transformation), and because I'm calling separate functions which operate on tuples of resulting columns to ensure some type stability (but not full type stability unfortunately). It would be interesting to try this approach here, but even without the type stability tricks you can take inspiration from it to reallocate vectors when needed. See in particular these lines.

nalimilan · 2018-09-18T08:58:35Z

Actually, there's a very interesting challenge here which could give incredible performance with some transformations to how the code works. If instead of passing g[i] to the anonymous function v we passed it directly the columns it operates on, and if we called that from a helper function dedicated to each transformation taking the said columns and applying it to all groups, we would have fully type-stable code and v would probably be inlined. In short, that's the best possible code one could write, even by hand. That would particularly increase performance when the number of groups is large.

Now, the v method taking a tuple of functions already exists, but we need to get the names of the columns it expects. Apparently this can be obtained by extracting the names of arguments via something like Symbol.(getindex.(Base.arg_decl_parts(methods(v).ms[2])[2][2:end], 1)). So it looks like it wouldn't be that hard.

pdeffebach · 2018-09-18T13:06:21Z

Thanks for the feedback. It looks like I had two misonceptions

I thought generators were faster. I thought it was a way to have Julia strore stuff but not allocate a contiguous array for it.
I assumed array re-typing is expensive. I'm glad to hear its not.

I will try the approach you showed me.

On the other hand, this kind of operation could arguably be in by or something similar. So perhaps eventually DataFramesMeta could just call a by operation.

nalimilan · 2018-09-18T13:19:14Z

1. I thought generators were faster. I thought it was a way to have Julia strore stuff but not allocate a contiguous array for it.

They are fast in the sense that they do not allocate a vector with all the results. But the downside is that they have to reevaluate the call each time you access them.

2. I assumed array re-typing is expensive. I'm glad to hear its not.

Re-typing is expensive since it requires copying all already processed elements. But that's less expensive than other solutions. Hopefully at some point we'll have a way of converting an Array{T} to an Array{Union{T,Missing}} without making a copy.

I will try the approach you showed me.

Cool!

On the other hand, this kind of operation could arguably be in by or something similar. So perhaps eventually DataFramesMeta could just call a by operation.

Actually DataFramesMeta can do things more efficiently than DataFrames since thanks to macros it knows which columns are involved in a computation and can specialize on them. OTC DataFrames's by passes a full data frame to the user-provided function, which creates a type instability (JuliaData/DataFrames.jl#1256). But you're right that @by could probably take advantage of the same approach as @transform(g::GroupedDataFrame, ...). AFAICT these are really the same operation.

pdeffebach · 2018-09-18T14:32:28Z

AFAICT these are really the same operation.

I meant that @transform(g::GroupedDataFrame, ..) spreads the results so that there is no collapse at any point, while all the by operations in DataFrames always collapse so you get a dataframe where each observation is a group. In Base DataFrames I think you would need a by and a join. Which is fine, because thats what DataFramesMeta is for.

pdeffebach · 2018-10-02T15:27:17Z

I would like to see this through so let's add the recursive function.

Do you mean something like this?

function _transform(first::AbstractVector, g::GroupedDataFrame, v::Function, i::Int, t::Vector, N::Int, starts, ends)
    out = v(g[i])
    # check that its a vector and length is right here...
    S = eltype(out)
    T = eltype(t)
    if !(S <: T || promote_type(S, T) <: T)
       # Problem: We have to calculate v(g[i]) again for however many times we promote
        return _transform(first, g, v, i, Tables.allocatecolumn(promote_type(S, T), N))
    end
    t[starts[i]:ends[i]] = out
    # t= _transform(first, g, v, i+1, t) make it truly recursive? 
    return t 
end

nalimilan · 2018-10-02T15:56:28Z

Yes, something like that. Thanks to the return, you don't need the line commented out at the bottom to make it "truly recursive". But you need to assign out to t before calling _transform, and pass i+1 to it rather than i; that avoids calling v twice.

pdeffebach · 2018-10-02T19:44:55Z

I just wrote a Recursive implementation that seems good to me (maybe). The good news is that we didn't hurt performance, as seen my comment on the gist above. Given that I probably made a number of errors in this re-write which will hurt performance, I am cautiously optimistic about this.

nalimilan · 2018-10-02T20:15:04Z

src/DataFramesMeta.jl

+    end
+    t[starts[i]:ends[i]] = current
+    if i != length(g) 
+        return _transform!(t, i + 1, current, v(g[i+1]), g, v, starts, ends)


The idea is that you'd call _transform only when re-allocating the column. Here you call it for each group, which will trigger a stack overflow for a large number of groups (and is probably slower).

Okay. I think I was misunderstanding .

We still have a for-loop. The only difference, really, is that we push the type promotion into a function. Which only ever has to run recursively once or twice.

pdeffebach · 2018-10-03T14:49:07Z

Thanks for the feedback. I'm still a bit confused though. The implementation I have here is as follows

function _transform(t, out, indexes)
    if we need to promote
         make promoted_vector 
         _transform(promoted_vector, out, indexes)
    end
    t[indexes] = out
    return t # otherwise we return a small view
end 

function transform(g, v)
    first = v(g[1])
    allocate array t
     input first value
     for i in 2:length(g)
          t = _transform(t, v(g[i], indexes)
     end
end

I hope I am understanding #1520 correctly, because I think this is more or less the same logic used in combine! there.

The issue is that the function _transform sometimes modifies the input vector t, in the case of no type promotion, and sometimes returns a new vector all together. We also change the type of t throughout the loop, but it seems like that is inevitable. Thankfully Julia assigns by reference so the extra t = _transform isn't allocating. But it needs to be there in the case where we promote type.

But you need to assign out to t before calling _transform, and pass i+1 to it rather than i; that avoids calling v twice.

I don't think that works because we need to either increment recursively for all iterations or none of them. If we have a for loop, we have to stick to a single counter and not call out of order.

Let me know if this works. Thanks.

nalimilan · 2018-10-03T15:24:06Z

The trick (as in JuliaData/DataFrames.jl#1520) is to have the loop inside _transform!. You don't change the type of t because you never reassign to it: instead, you call return _transform!(...)` with the newly allocated vector (stored using a different name).

pdeffebach · 2018-10-03T15:32:07Z

Ah I see it! colstart there means you only restart the loop at the necessary location! That's clever, thanks.

pdeffebach · 2018-10-16T18:54:14Z

I think this is following the implementation in #1520, more or less. Performance is better than before without the error checks, and about on par with the error checks.

Let me know if there are big performance traps I am stepping in!

nalimilan

Thanks, looks almost ready. You should be able to improve performance in another PR by specializing on the columns (as noted before).

nalimilan · 2018-10-16T19:50:53Z

src/DataFramesMeta.jl

    end
    return result
 end

+function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)
+    # handle the first case 
+    j = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))


j is a bit weird here. I used that name in the other PR because it was an index. Maybe promoted just like in the other place? Or even better newT/newtype (in both places): promoted sounds like a Boolean.

nalimilan · 2018-10-16T19:51:24Z

src/DataFramesMeta.jl

    end
    return result
 end

+function _transform!(t::AbstractVector, first::AbstractVector, start::Int, g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)


Wrap all lines at 92 chars.

nalimilan · 2018-10-16T19:53:31Z

src/DataFramesMeta.jl

+    T = eltype(t)
+    promoted = promote_type(elout, T)
+    if (elout <: T || promoted <: T)
+        t[startpoint:endpoint] = out


Better put the return nothing here for clarity.

nalimilan · 2018-10-16T19:53:44Z

src/DataFramesMeta.jl

+end 
+
+function fill_column_any!(t::AbstractVector, out, startpoint::Int, endpoint::Int)
+    if (out isa AbstractVector)


No parentheses.

Suggested change

if (out isa AbstractVector)

if out isa AbstractVector

nalimilan · 2018-10-16T19:54:22Z

src/DataFramesMeta.jl

+    elout = eltype(out)
+    T = eltype(t)
+    promoted = promote_type(elout, T)
+    if (elout <: T || promoted <: T)


No parentheses:

Suggested change

if (elout <: T || promoted <: T)

if elout <: T || promoted <: T

nalimilan · 2018-10-16T19:55:03Z

src/DataFramesMeta.jl

+    T = eltype(t)
+    promoted = promote_type(typout, T)
+    if (typout <: T || promoted <: T)
+        @views t[startpoint:endpoint] .= Ref(out)


@views isn't needed on assignment.

Suggested change

@views t[startpoint:endpoint] .= Ref(out)

t[startpoint:endpoint] .= Ref(out)

nalimilan · 2018-10-16T19:55:30Z

src/DataFramesMeta.jl

+    for i in (start+1):length(g)
+        out = v(g[i])
+        promoted = fill_column_any!(t, out, starts[i], ends[i])
+        if !(promoted === nothing)


Suggested change

if !(promoted === nothing)

if promoted !== nothing

nalimilan · 2018-10-16T19:56:06Z

src/DataFramesMeta.jl

+    for i in (start+1):length(g)
+        out = v(g[i])
+        promoted = fill_column_vec!(t, out, starts[i], ends[i], size(g[i], 1))
+        if !(promoted === nothing)


Suggested change

if !(promoted === nothing)

if promoted !== nothing

nalimilan · 2018-10-16T19:56:33Z

src/DataFramesMeta.jl

@@ -1,7 +1,6 @@
 module DataFramesMeta

-using DataFrames
-
+using DataFrames, Tables


Re-add empty line.

nalimilan · 2018-10-16T19:57:22Z

test/grouping.jl

+# Type promotion
+@test (@transform(g, t = isequal(:b[1], 1) ? fill(1, length(:b)) : fill(2.0, length(:b)))[:t] .=== 
+                  [1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0]) |> all
+# Vectors of different types


Suggested change

# Vectors of different types

# Vectors whose eltypes promote to Any

pdeffebach · 2018-10-16T22:44:37Z

Thanks for the feedback!

pdeffebach · 2018-10-16T22:53:27Z

The performance for the vector fill leaves something to be desired, it seems, using the gist here. With this new commit the "fewer vectors" case is ~ 180 on this branch and ~ 150 on master.

I'm not sure what's going on, though.

nalimilan · 2018-10-17T09:17:35Z

Maybe add @inline fill_column_vec!? The reason why I used separate functions in my PR is that it allowed specializing on the number of columns to copy. But here since there's a single column there's no advantage and the dynamic dispatch can have some overhead. You could even move them inside the function, the code would be clearer.

pdeffebach · 2018-10-17T15:13:11Z

I added @noinline and things improved some. It's worth nothing that for the "many groups" case, 5000 as opposed to 1000, there are performance gains of approx. 10%

nalimilan · 2018-10-17T20:18:15Z

src/DataFramesMeta.jl

@@ -405,9 +405,29 @@ end

 function _transform!(t::AbstractVector, first::AbstractVector, start::Int, 
                     g::GroupedDataFrame, v::Function, starts::Vector, ends::Vector)
+    @inline function fill_column_vec!(t::AbstractVector, out, startpoint::Int, endpoint::Int, len::Int)


What I mean when I said you could move the code inside the function is that if you use @inline, you can just drop the function barrier and put the code directly in the parent function.

It should probably still be a function, since we call it twice per _transform! function. Once for the first case and once for the rest. We need to call it twice because we have to compute first in transform(::GroupedDataFrame,...) so not treating the first case separately would require calculating first twice.

Right. Should be OK as-is then.

nalimilan · 2018-10-17T20:18:45Z

src/DataFramesMeta.jl

-    newtype = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))
-    @assert newtype === nothing
+    newtype_first = fill_column_vec!(t, first, starts[start], ends[start], size(g[start], 1))
+    #@assert newtype_first === nothing


I don't expect this check to be costly. Is it?

sorry. no its not.

nalimilan · 2018-10-17T20:23:27Z

src/DataFramesMeta.jl

+        end     
+        elout = eltype(out)
+        T = eltype(t)
+        newtype = promote_type(elout, T)


Maybe you can make things slightly faster by moving this call after || below, that the type is only computed when it's needed. You could even try adding elout === T || as a first condition in case it's faster (not sure).

Also elout sounds like it's an element, not a type.

wow thats really smart that julia allows for that. But alas there isn't any speed gain from this:

if typout <: T || (newtype = promote_type(typout, T)) <: T t[startpoint:endpoint] .= Ref(out) return nothing else return newtype end

OK, too bad.

nalimilan

Looks good!

@tshort OK to merge this? It should allow moving on to the next step, which is generating efficient anonymous functions working only on the relevant columns.

tshort · 2018-10-18T19:53:33Z

+1 for me. Thanks for all the work @pdeffebach, and thanks for the great help and interaction, @nalimilan.

pdeffebach mentioned this pull request Aug 2, 2018

Allow for scalars in a grouped @transform #99

Closed

bkamins requested changes Aug 12, 2018

View reviewed changes

bkamins requested changes Aug 13, 2018

View reviewed changes

bkamins approved these changes Aug 15, 2018

View reviewed changes

nalimilan mentioned this pull request Aug 16, 2018

Add optimized implementations of reduce([hv]cat, itr) for iterators with known length JuliaLang/julia#28691

Open

nalimilan reviewed Aug 16, 2018

View reviewed changes

pdeffebach mentioned this pull request Aug 18, 2018

Upgrade to Julia 1.0 #104

Merged

pdeffebach force-pushed the group-transform-overhaul branch from dba551c to 649e78f Compare September 17, 2018 20:20

Change the way grouped transforms work

1965c7b

Continue working Better error message and no allocation for iterator Reduce number of changes, use mapreduce for T Go back to idx Add if-else for scalars

pdeffebach force-pushed the group-transform-overhaul branch from 649e78f to 1965c7b Compare September 17, 2018 22:10

Make everything recursive

26cac9c

nalimilan reviewed Oct 2, 2018

View reviewed changes

Improve recursion, need some comments

28a775b

nalimilan mentioned this pull request Oct 5, 2018

Add Julia h2oai/db-benchmark#30

Closed

Do recursive correctly following #1520

6839b9b

nalimilan reviewed Oct 16, 2018

View reviewed changes

Respond to recursion comments

dfc4a2f

add @noinline, fix some names

df6503a

fix line breaks

bdf4a79

nalimilan reviewed Oct 17, 2018

View reviewed changes

Final fixes

6aea278

nalimilan approved these changes Oct 18, 2018

View reviewed changes

nalimilan merged commit db41eed into JuliaData:master Oct 18, 2018

pdeffebach deleted the group-transform-overhaul branch October 18, 2018 21:26

pdeffebach mentioned this pull request Oct 18, 2018

Grouped by issue with type conversion #98

Closed

bkamins mentioned this pull request Nov 1, 2018

Use faster hashing approach for first CategoricalVector grouping key JuliaData/DataFrames.jl#1565

Merged

pdeffebach mentioned this pull request Oct 7, 2020

Don't rely on Tables #189

Closed

	if (elout <: T \|\| promoted <: T)
	if elout <: T \|\| promoted <: T

	@views t[startpoint:endpoint] .= Ref(out)
	t[startpoint:endpoint] .= Ref(out)

	# Vectors of different types
	# Vectors whose eltypes promote to Any

Change the way grouped transforms work #101

Change the way grouped transforms work #101

Conversation

pdeffebach commented Aug 2, 2018

pdeffebach commented Aug 3, 2018

bkamins commented Aug 3, 2018 • edited Loading

pdeffebach commented Aug 3, 2018

pdeffebach commented Aug 3, 2018

pdeffebach commented Aug 3, 2018

bkamins commented Aug 4, 2018

pdeffebach commented Aug 4, 2018 • edited Loading

pdeffebach commented Aug 12, 2018

bkamins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Aug 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Aug 13, 2018

pdeffebach commented Aug 15, 2018

bkamins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Aug 19, 2018 • edited Loading

tshort commented Aug 19, 2018

pdeffebach commented Sep 17, 2018

nalimilan commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

nalimilan commented Sep 18, 2018

pdeffebach commented Sep 18, 2018

pdeffebach commented Oct 2, 2018 • edited Loading

nalimilan commented Oct 2, 2018

pdeffebach commented Oct 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Oct 3, 2018

nalimilan commented Oct 3, 2018

pdeffebach commented Oct 3, 2018

pdeffebach commented Oct 16, 2018

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Oct 16, 2018

pdeffebach commented Oct 16, 2018

nalimilan commented Oct 17, 2018

pdeffebach commented Oct 17, 2018

Choose a reason for hiding this comment

bkamins commented Aug 3, 2018 •

edited

Loading

pdeffebach commented Aug 4, 2018 •

edited

Loading

pdeffebach commented Aug 19, 2018 •

edited

Loading

pdeffebach commented Oct 2, 2018 •

edited

Loading

nalimilan Oct 18, 2018 •

edited

Loading