Improve performance of by() using NamedTuples

Remove GroupApplied and deprecate combine in favor of map(f, ::GroupedDataFrame). This avoids storing a copy of the per-group data returned by the user-provided function. Take advantage of this by allowing that function to return a NamedTuple. Introduce two completely different code paths depending on whether the first returned object is DataFrame or a NamedTuple, as the latter allows for more efficient operation by assuming that it represents a single row. Use the same progressive eltype widening approach as Base.map so that we fill column vectors whose types are known inside the kernel functions. This does not eliminate the type unstability due to the fact that the user-provided function takes a DataFrame, but ensuring type stability for half of the operations still improves performance significantly. Also parameterize GroupedDataFrame on the type of data frame it wraps, and make its column index have a concrete type. Deprecate an old map method for SubDataFrame. Fix a type unstability in hcat!.
JuliaData · Sep 20, 2018 · 671e69a · 671e69a
1 parent eb21906
commit 671e69a
Show file tree

Hide file tree

Showing 8 changed files with 369 additions and 181 deletions.
diff --git a/docs/src/lib/functions.md b/docs/src/lib/functions.md
@@ -28,7 +28,7 @@ meltdf
 ```@docs
 allowmissing!
 categorical!
-combine
+map
 completecases
 deleterows!
 describe

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -7,6 +7,7 @@ module DataFrames
 ##############################################################################
 
 using Reexport, StatsBase, SortingAlgorithms, Compat, Statistics, Unicode, Printf
+using Base.Iterators
 @reexport using CategoricalArrays, Missings
 using Base.Sort, Base.Order
 
@@ -28,7 +29,6 @@ export AbstractDataFrame,
        by,
        categorical!,
        colwise,
-       combine,
        completecases,
        deleterows!,
        describe,

diff --git a/src/dataframe/dataframe.jl b/src/dataframe/dataframe.jl
@@ -851,7 +851,7 @@ end
 
 # definition required to avoid hcat! ambiguity
 function hcat!(df1::DataFrame, df2::DataFrame; makeunique::Bool=false)
-    invoke(hcat!, Tuple{DataFrame, AbstractDataFrame}, df1, df2, makeunique=makeunique)
+    invoke(hcat!, Tuple{DataFrame, AbstractDataFrame}, df1, df2, makeunique=makeunique)::DataFrame
 end
 
 hcat!(df::DataFrame, x::AbstractVector; makeunique::Bool=false) =

diff --git a/src/deprecated.jl b/src/deprecated.jl
@@ -1370,3 +1370,8 @@ import Base: show
 @deprecate showall(io::IO, df::GroupedDataFrame) show(io, df, allgroups=true)
 @deprecate showall(df::GroupedDataFrame) show(df, allgroups=true)
 
+import Base: map
+@deprecate map(f::Function, sdf::SubDataFrame) f(sdf)
+
+@deprecate combine(f::Function, gd::GroupedDataFrame) map(f, gd)
+@deprecate combine(gd::GroupedDataFrame) map(identity, gd)