Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support type-stable map and combine on GroupedDataFrame #1601

Merged
merged 16 commits into from
Dec 3, 2018
72 changes: 50 additions & 22 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,25 @@

Many data analysis tasks involve splitting a data set into groups, applying some functions to each of the groups and then combining the results. A standardized framework for handling this sort of computation is described in the paper "[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)", written by Hadley Wickham.

The DataFrames package supports the Split-Apply-Combine strategy through the `by` function, which takes in three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) a function or expression to apply to each subset of the DataFrame.
The DataFrames package supports the split-apply-combine strategy through the `by` function, which is a shorthand for `groupby` followed by `map` and/or `combine`. `by` takes in three arguments: (1) a `DataFrame`, (2) one or more columns to split the `DataFrame` on, and (3) a specification of one or more functions to apply to each subset of the `DataFrame`. This specification can be of the following forms:
1. a `col => function` pair indicating that `function` should be called with the vector of values for column `col`, which can be a column name or index
2. a `cols => function` pair indicating that `function` should be called with a named tuple holding columns `cols`, which can be a tuple or vector of names or indices
3. several such pairs, either as positional arguments or as keyword arguments (mixing is not allowed), producing each a single separate column; keyword argument names are used as column names
4. equivalently, a (named) tuple or vector of such pairs
5. a function which will be called with a `SubDataFrame` corresponding to each group; this form should be avoided due to its poor performance

In all of these cases, the function can return either a single row or multiple rows, with a single or multiple columns:
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
- a single value produces a single row and column per group
- a named tuple or `DataFrameRow` produces a single row and one column per field
- a vector produces a single column with one row per entry
- a named tuple of vectors produces one column per field with one row per entry in the vectors
- a `DataFrame` or a matrix produces as many rows and columns as it contains; note that returning a `DataFrame` should be avoided due to its poor performance when the number of groups is large

The kind of return value and the number and names of columns must be the same for all groups.

As a special case, if multiple pairs or a tuple of vectors or pairs is passed (forms 3 and 4 above), each function is required to return a single value or vector, which will produce each a separate column.

nalimilan marked this conversation as resolved.
Show resolved Hide resolved
The name for the resulting column can be chosen either by passing a named tuple of pairs, or by returning a named tuple or a data frame. If no name is provided, it is generated automatically. For functions taking a single column (first form), the input column name is concatenated with the function name: for standard functions like `mean` this will produce columns with names like `SepalLength_mean`; for anonymous functions like `x -> sqrt(x)^e`, the produced columns will be `SepalLength_function`. For functions taking multiple columns (second form), names are `x1`, `x2`, etc.

We show several examples of the `by` function applied to the `iris` dataset below:

Expand Down Expand Up @@ -35,39 +53,49 @@ julia> tail(iris)
│ 5 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ virginica │
│ 6 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ virginica │

julia> by(iris, :Species, size)
julia> by(iris, :Species, :PetalLength => mean)
3×2 DataFrame
│ Row │ Species │ x1 │
│ │ Categorical…⍰ │ Tuple… │
├─────┼───────────────┼─────────┤
│ 1 │ setosa │ (50, 5) │
│ 2 │ versicolor │ (50, 5) │
│ 3 │ virginica │ (50, 5) │

julia> by(iris, :Species, df -> mean(df.PetalLength))
3×2 DataFrame
│ Row │ Species │ x1 │
│ │ Categorical…⍰ │ Float64 │
├─────┼───────────────┼─────────┤
│ 1 │ setosa │ 1.462 │
│ 2 │ versicolor │ 4.26 │
│ 3 │ virginica │ 5.552 │

julia> by(iris, :Species, df -> DataFrame(N = size(df, 1)))
│ Row │ Species │ PetalLength_mean │
│ │ String⍰ │ Float64 │
├─────┼────────────┼──────────────────┤
│ 1 │ setosa │ 1.462 │
│ 2 │ versicolor │ 4.26 │
│ 3 │ virginica │ 5.552 │

julia> by(iris, :Species, N = :Species => length) # Chosen column is arbitrary
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
3×2 DataFrame
│ Row │ Species │ N │
│ │ Categorical…⍰ │ Int64 │
├─────┼───────────────┼───────┤
│ 1 │ setosa │ 50 │
│ 2 │ versicolor │ 50 │
│ 3 │ virginica │ 50 │

julia> by(iris, :Species, N = :Species => length, mean = :PetalLength => mean) # Column for length is arbitrary
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
3×3 DataFrame
│ Row │ Species │ N │ mean │
│ │ String⍰ │ Int64 │ Float64 │
├─────┼────────────┼───────┼─────────┤
│ 1 │ setosa │ 50 │ 1.462 │
│ 2 │ versicolor │ 50 │ 4.26 │
│ 3 │ virginica │ 50 │ 5.552 │

julia> by(iris, :Species, [:PetalLength, :SepalLength] =>
x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))
3×3 DataFrame
│ Row │ Species │ a │ b │
│ │ String⍰ │ Float64 │ Float64 │
├─────┼────────────┼──────────┼─────────┤
│ 1 │ setosa │ 0.29205 │ 73.1 │
│ 2 │ versicolor │ 0.717655 │ 213.0 │
│ 3 │ virginica │ 0.842744 │ 277.6 │
```

The `by` function also support the `do` block form:
The `by` function also supports the `do` block form. However, as noted above, this form is slow and should therefore be avoided when performance matters.

```jldoctest sac
julia> by(iris, :Species) do df
DataFrame(m = mean(df.PetalLength), s² = var(df.PetalLength))
(m = mean(df.PetalLength), s² = var(df.PetalLength))
end
3×3 DataFrame
│ Row │ Species │ m │ s² │
Expand All @@ -78,7 +106,7 @@ julia> by(iris, :Species) do df
│ 3 │ virginica │ 5.552 │ 0.304588 │
```

A second approach to the Split-Apply-Combine strategy is implemented in the `aggregate` function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column that was not used to split the DataFrame, creating new columns of the form `$name_$function`. For named functions like `mean` this will produce columns with names like `SepalLength_mean`. For anonymous functions like `x -> sqrt(x)^e`, which Julia tracks and references by a numerical identifier e.g. `#12`, the produced columns will be `SepalLength_#12`. We show several examples of the `aggregate` function applied to the `iris` dataset below:
A second approach to the Split-Apply-Combine strategy is implemented in the `aggregate` function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column that was not used to split the DataFrame, creating new columns of the form `$name_$function` like with `by` (see above). We show several examples of the `aggregate` function applied to the `iris` dataset below:

```jldoctest sac
julia> aggregate(iris, :Species, length)
Expand Down
6 changes: 6 additions & 0 deletions src/deprecated.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1430,6 +1430,12 @@ function Base.getindex(itr::DataFrameColumns{<:SubDataFrame,AbstractVector}, j)
itr.df[:, j]
end

function Base.getindex(itr::DataFrameColumns{<:SubDataFrame,AbstractVector}, j::Int)
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
Base.depwarn("Indexing into a return value of columns on SubDataFrame will return a" *
" view of column value", :getindex)
itr.df[:, j]
end

function Base.iterate(itr::DataFrameColumns{<:SubDataFrame,
Pair{Symbol, AbstractVector}}, j=1)
Base.depwarn("iterating over value of eachcol on SubDataFrame will return a" *
Expand Down
Loading