JuliaData · nalimilan · Dec 3, 2018 · Nov 17, 2018 · Nov 22, 2018 · Nov 23, 2018
diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -2,7 +2,25 @@
 
 Many data analysis tasks involve splitting a data set into groups, applying some functions to each of the groups and then combining the results. A standardized framework for handling this sort of computation is described in the paper "[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)", written by Hadley Wickham.
 
-The DataFrames package supports the Split-Apply-Combine strategy through the `by` function, which takes in three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) a function or expression to apply to each subset of the DataFrame.
+The DataFrames package supports the split-apply-combine strategy through the `by` function, which is a shorthand for `groupby` followed by `map` and/or `combine`. `by` takes in three arguments: (1) a `DataFrame`, (2) one or more columns to split the `DataFrame` on, and (3) a specification of one or more functions to apply to each subset of the `DataFrame`. This specification can be of the following forms:
+1. a `col => function` pair indicating that `function` should be called with the vector of values for column `col`, which can be a column name or index
+2. a `cols => function` pair indicating that `function` should be called with a named tuple holding columns `cols`, which can be a tuple or vector of names or indices
+3. several such pairs, either as positional arguments or as keyword arguments (mixing is not allowed), producing each a single separate column; keyword argument names are used as column names
+4. equivalently, a (named) tuple or vector of such pairs
+5. a function which will be called with a `SubDataFrame` corresponding to each group; this form should be avoided due to its poor performance
+
+In all of these cases, the function can return either a single row or multiple rows, with a single or multiple columns:
+- a single value produces a single row and column per group
+- a named tuple or `DataFrameRow` produces a single row and one column per field
+- a vector produces a single column with one row per entry
+- a named tuple of vectors produces one column per field with one row per entry in the vectors
+- a `DataFrame` or a matrix produces as many rows and columns as it contains; note that returning a `DataFrame` should be avoided due to its poor performance when the number of groups is large
+
+The kind of return value and the number and names of columns must be the same for all groups.
+
+As a special case, if multiple pairs or a tuple of vectors or pairs is passed (forms 3 and 4 above), each function is required to return a single value or vector, which will produce each a separate column.
+
+The name for the resulting column can be chosen either by passing a named tuple of pairs, or by returning a named tuple or a data frame. If no name is provided, it is generated automatically. For functions taking a single column (first form), the input column name is concatenated with the function name: for standard functions like `mean` this will produce columns with names like `SepalLength_mean`; for anonymous functions like `x -> sqrt(x)^e`, the produced columns will be `SepalLength_function`. For functions taking multiple columns (second form), names are `x1`, `x2`, etc.
 
 We show several examples of the `by` function applied to the `iris` dataset below:
 
@@ -35,39 +53,49 @@ julia> tail(iris)
 │ 5   │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ virginica     │
 │ 6   │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ virginica     │
 
-julia> by(iris, :Species, size)
+julia> by(iris, :Species, :PetalLength => mean)
 3×2 DataFrame
-│ Row │ Species       │ x1      │
-│     │ Categorical…⍰ │ Tuple…  │
-├─────┼───────────────┼─────────┤
-│ 1   │ setosa        │ (50, 5) │
-│ 2   │ versicolor    │ (50, 5) │
-│ 3   │ virginica     │ (50, 5) │
-
-julia> by(iris, :Species, df -> mean(df.PetalLength))
-3×2 DataFrame
-│ Row │ Species       │ x1      │
-│     │ Categorical…⍰ │ Float64 │
-├─────┼───────────────┼─────────┤
-│ 1   │ setosa        │ 1.462   │
-│ 2   │ versicolor    │ 4.26    │
-│ 3   │ virginica     │ 5.552   │
-
-julia> by(iris, :Species, df -> DataFrame(N = size(df, 1)))
+│ Row │ Species    │ PetalLength_mean │
+│     │ String⍰    │ Float64          │
+├─────┼────────────┼──────────────────┤
+│ 1   │ setosa     │ 1.462            │
+│ 2   │ versicolor │ 4.26             │
+│ 3   │ virginica  │ 5.552            │
+
+julia> by(iris, :Species, N = :Species => length) # Chosen column is arbitrary
 3×2 DataFrame
 │ Row │ Species       │ N     │
 │     │ Categorical…⍰ │ Int64 │
 ├─────┼───────────────┼───────┤
 │ 1   │ setosa        │ 50    │
 │ 2   │ versicolor    │ 50    │
 │ 3   │ virginica     │ 50    │
+
+julia> by(iris, :Species, N = :Species => length, mean = :PetalLength => mean) # Column for length is arbitrary
+3×3 DataFrame
+│ Row │ Species    │ N     │ mean    │
+│     │ String⍰    │ Int64 │ Float64 │
+├─────┼────────────┼───────┼─────────┤
+│ 1   │ setosa     │ 50    │ 1.462   │
+│ 2   │ versicolor │ 50    │ 4.26    │
+│ 3   │ virginica  │ 50    │ 5.552   │
+
+julia> by(iris, :Species, [:PetalLength, :SepalLength] =>
+              x -> (a=mean(x.PetalLength)/mean(x.SepalLength), b=sum(x.PetalLength)))
+3×3 DataFrame
+│ Row │ Species    │ a        │ b       │
+│     │ String⍰    │ Float64  │ Float64 │
+├─────┼────────────┼──────────┼─────────┤
+│ 1   │ setosa     │ 0.29205  │ 73.1    │
+│ 2   │ versicolor │ 0.717655 │ 213.0   │
+│ 3   │ virginica  │ 0.842744 │ 277.6   │
 ```
 
-The `by` function also support the `do` block form:
+The `by` function also supports the `do` block form. However, as noted above, this form is slow and should therefore be avoided when performance matters.
 
 ```jldoctest sac
 julia> by(iris, :Species) do df
-          DataFrame(m = mean(df.PetalLength), s² = var(df.PetalLength))
+          (m = mean(df.PetalLength), s² = var(df.PetalLength))
        end
 3×3 DataFrame
 │ Row │ Species       │ m       │ s²        │
@@ -78,7 +106,7 @@ julia> by(iris, :Species) do df
 │ 3   │ virginica     │ 5.552   │ 0.304588  │
 ```
 
-A second approach to the Split-Apply-Combine strategy is implemented in the `aggregate` function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column that was not used to split the DataFrame, creating new columns of the form `$name_$function`. For named functions like `mean` this will produce columns with names like `SepalLength_mean`. For anonymous functions like `x -> sqrt(x)^e`, which Julia tracks and references by a numerical identifier e.g. `#12`, the produced columns will be `SepalLength_#12`. We show several examples of the `aggregate` function applied to the `iris` dataset below:
+A second approach to the Split-Apply-Combine strategy is implemented in the `aggregate` function, which also takes three arguments: (1) a DataFrame, (2) one or more columns to split the DataFrame on, and (3) one or more functions that are used to compute a summary of each subset of the DataFrame. Each function is applied to each column that was not used to split the DataFrame, creating new columns of the form `$name_$function` like with `by` (see above). We show several examples of the `aggregate` function applied to the `iris` dataset below:
 
 ```jldoctest sac
 julia> aggregate(iris, :Species, length)

diff --git a/src/deprecated.jl b/src/deprecated.jl
@@ -1430,6 +1430,12 @@ function Base.getindex(itr::DataFrameColumns{<:SubDataFrame,AbstractVector}, j)
     itr.df[:, j]
 end
 
+function Base.getindex(itr::DataFrameColumns{<:SubDataFrame,AbstractVector}, j::Int)
+    Base.depwarn("Indexing into a return value of columns on SubDataFrame will return a" *
+                 " view of column value", :getindex)
+    itr.df[:, j]
+end
+
 function Base.iterate(itr::DataFrameColumns{<:SubDataFrame,
                                             Pair{Symbol, AbstractVector}}, j=1)
     Base.depwarn("iterating over value of eachcol on SubDataFrame will return a" *