JuliaData · bkamins · Apr 14, 2020 · Apr 11, 2020 · Apr 11, 2020 · Apr 11, 2020
diff --git a/docs/src/lib/types.md b/docs/src/lib/types.md
@@ -52,6 +52,9 @@ but they are columns of a `DataFrame` returned by `stack` with `view=true`.
 The `ByRow` type is a special type used for selection operations to signal that the wrapped function should be applied
 to each element (row) of the selection.
 
+The `AsTable` type is a special type used for selection operations to signal that the columns selected by a wrapped
+selector should be passed as a `NamedTuple` to the function.
+
 ## [The design of handling of columns of a `DataFrame`](@id man-columnhandling)
 
 When a `DataFrame` is constructed columns are copied by default. You can disable

diff --git a/docs/src/man/getting_started.md b/docs/src/man/getting_started.md
@@ -634,7 +634,9 @@ julia> df
 ```
 
 `transform` and `transform!` functions work identically to `select` and `select!` with the only difference that
-they retain all columns that are present in the source data frame, for example:
+they retain all columns that are present in the source data frame. Here are some more advanced examples.
+
+First we show how to generate a column that is a sum of all other columns in the data frame using `All()` selector:
 
 ```jldoctest dataframe
 julia> df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])
@@ -653,8 +655,86 @@ julia> transform(df, All() => +)
 │ 1   │ 1     │ 3     │ 5     │ 9         │
 │ 2   │ 2     │ 4     │ 6     │ 12        │
 ```
+Here we wrap rows of the data frame into a `NamedTuple` while remembering
+source column names.
+```
+julia> transform(df, AsTable(:) => ByRow(identity))
+2×4 DataFrame
+│ Row │ x1    │ x2    │ y     │ x1_x2_y_identity        │
+│     │ Int64 │ Int64 │ Int64 │ NamedTuple…             │
+├─────┼───────┼───────┼───────┼─────────────────────────┤
+│ 1   │ 1     │ 3     │ 5     │ (x1 = 1, x2 = 3, y = 5) │
+│ 2   │ 2     │ 4     │ 6     │ (x1 = 2, x2 = 4, y = 6) │
+```
+Note that the same column could be generated by using the `Tables.rowtable` function:
+```
+julia> Tables.rowtable(df)
+2-element Array{NamedTuple{(:x1, :x2, :y),Tuple{Int64,Int64,Int64}},1}:
+ (x1 = 1, x2 = 3, y = 5)
+ (x1 = 2, x2 = 4, y = 6)
+```
+Now assume that a data frame `df` contains predictions from a model producing scores
+for three levels `a`, `b` and `c` of a nominar target variable.
+For each row we want to get the level with the highest score.
+```
+julia> using Random
+
+julia> Random.seed!(1);
+
+julia> df = DataFrame(rand(10, 3), [:a, :b, :c])
+10×3 DataFrame
+│ Row │ a          │ b         │ c         │
+│     │ Float64    │ Float64   │ Float64   │
+├─────┼────────────┼───────────┼───────────┤
+│ 1   │ 0.236033   │ 0.555751  │ 0.0769509 │
+│ 2   │ 0.346517   │ 0.437108  │ 0.640396  │
+│ 3   │ 0.312707   │ 0.424718  │ 0.873544  │
+│ 4   │ 0.00790928 │ 0.773223  │ 0.278582  │
+│ 5   │ 0.488613   │ 0.28119   │ 0.751313  │
+│ 6   │ 0.210968   │ 0.209472  │ 0.644883  │
+│ 7   │ 0.951916   │ 0.251379  │ 0.0778264 │
+│ 8   │ 0.999905   │ 0.0203749 │ 0.848185  │
+│ 9   │ 0.251662   │ 0.287702  │ 0.0856352 │
+│ 10  │ 0.986666   │ 0.859512  │ 0.553206  │
+
+julia> transform(df, AsTable(:) => ByRow(argmax) => :prediction)
+10×4 DataFrame
+│ Row │ a          │ b         │ c         │ prediction │
+│     │ Float64    │ Float64   │ Float64   │ Symbol     │
+├─────┼────────────┼───────────┼───────────┼────────────┤
+│ 1   │ 0.236033   │ 0.555751  │ 0.0769509 │ b          │
+│ 2   │ 0.346517   │ 0.437108  │ 0.640396  │ c          │
+│ 3   │ 0.312707   │ 0.424718  │ 0.873544  │ c          │
+│ 4   │ 0.00790928 │ 0.773223  │ 0.278582  │ b          │
+│ 5   │ 0.488613   │ 0.28119   │ 0.751313  │ c          │
+│ 6   │ 0.210968   │ 0.209472  │ 0.644883  │ c          │
+│ 7   │ 0.951916   │ 0.251379  │ 0.0778264 │ a          │
+│ 8   │ 0.999905   │ 0.0203749 │ 0.848185  │ a          │
+│ 9   │ 0.251662   │ 0.287702  │ 0.0856352 │ b          │
+│ 10  │ 0.986666   │ 0.859512  │ 0.553206  │ a          │
+```
+In the following, most complex, example below we compute row-wise sum, number of elements, and mean,
+while ignoring missing values.
+```
+julia> using Statistics
+
+julia> df = DataFrame(x=[1, 2, missing], y=[1, missing, missing]);
+
+julia> transform(df, AsTable(:) .=>
+                     ByRow.([sum∘skipmissing,
+                             x -> count(!ismissing, x),
+                             mean∘skipmissing]) .=>
+                     [:sum, :n, :mean])
+3×5 DataFrame
+│ Row │ x       │ y       │ sum   │ n     │ mean    │
+│     │ Int64?  │ Int64?  │ Int64 │ Int64 │ Float64 │
+├─────┼─────────┼─────────┼───────┼───────┼─────────┤
+│ 1   │ 1       │ 1       │ 2     │ 2     │ 1.0     │
+│ 2   │ 2       │ missing │ 2     │ 1     │ 2.0     │
+│ 3   │ missing │ missing │ 0     │ 0     │ NaN     │
+```
 
-While the DataFrames package provides basic data manipulation capabilities,
+While the DataFrames.jl package provides basic data manipulation capabilities,
 users are encouraged to use querying frameworks for more convenient and powerful operations:
 - the [Query.jl](https://github.com/davidanthoff/Query.jl) package provides a
 [LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx)-like interface to a large number of data sources

diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -28,6 +28,12 @@ each subset of the `DataFrame`. This specification can be of the following forms
 
 All forms except 1 and 6 can be also passed as the first argument to `map`.
 
+Three are two special rules that apply to `cols => function` syntax:
+1. if `cols` is wrapped in `AsTable` object then a `NamedTuple` containing columns
+   selected by `cols` is passed to `function`
+2. if `function` is wrapped in a `ByRow` object, then it will be passed values from single
+   rows each group and always return a vector of values produced by `function` applied to them
+
 In all of these cases, `function` can return either a single row or multiple rows.
 `function` can always generate a single column by returning a single value or a vector.
 Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
@@ -60,31 +66,27 @@ We show several examples of the `by` function applied to the `iris` dataset belo
 ```jldoctest sac
 julia> using DataFrames, CSV, Statistics
 
-julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")));
-
-julia> first(iris, 6)
-6×6 DataFrame
-│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species     │ id    │
-│     │ Float64     │ Float64    │ Float64     │ Float64    │ String      │ Int64 │
-├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┼───────┤
-│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ Iris-setosa │ 1     │
-│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ Iris-setosa │ 2     │
-│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ Iris-setosa │ 3     │
-│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ Iris-setosa │ 4     │
-│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ Iris-setosa │ 5     │
-│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ Iris-setosa │ 6     │
-
-julia> last(iris, 6)
-6×6 DataFrame
-│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species        │ id    │
-│     │ Float64     │ Float64    │ Float64     │ Float64    │ String         │ Int64 │
-├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┼───────┤
-│ 1   │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ Iris-virginica │ 145   │
-│ 2   │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ Iris-virginica │ 146   │
-│ 3   │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ Iris-virginica │ 147   │
-│ 4   │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ Iris-virginica │ 148   │
-│ 5   │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ Iris-virginica │ 149   │
-│ 6   │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ Iris-virginica │ 150   │
+julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")))
+150×5 DataFrame
+│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species        │
+│     │ Float64     │ Float64    │ Float64     │ Float64    │ String         │
+├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
+│ 1   │ 5.1         │ 3.5        │ 1.4         │ 0.2        │ Iris-setosa    │
+│ 2   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ Iris-setosa    │
+│ 3   │ 4.7         │ 3.2        │ 1.3         │ 0.2        │ Iris-setosa    │
+│ 4   │ 4.6         │ 3.1        │ 1.5         │ 0.2        │ Iris-setosa    │
+│ 5   │ 5.0         │ 3.6        │ 1.4         │ 0.2        │ Iris-setosa    │
+│ 6   │ 5.4         │ 3.9        │ 1.7         │ 0.4        │ Iris-setosa    │
+│ 7   │ 4.6         │ 3.4        │ 1.4         │ 0.3        │ Iris-setosa    │
+⋮
+│ 143 │ 5.8         │ 2.7        │ 5.1         │ 1.9        │ Iris-virginica │
+│ 144 │ 6.8         │ 3.2        │ 5.9         │ 2.3        │ Iris-virginica │
+│ 145 │ 6.7         │ 3.3        │ 5.7         │ 2.5        │ Iris-virginica │
+│ 146 │ 6.7         │ 3.0        │ 5.2         │ 2.3        │ Iris-virginica │
+│ 147 │ 6.3         │ 2.5        │ 5.0         │ 1.9        │ Iris-virginica │
+│ 148 │ 6.5         │ 3.0        │ 5.2         │ 2.0        │ Iris-virginica │
+│ 149 │ 6.2         │ 3.4        │ 5.4         │ 2.3        │ Iris-virginica │
+│ 150 │ 5.9         │ 3.0        │ 5.1         │ 1.8        │ Iris-virginica │
 
 julia> by(iris, :Species, :PetalLength => mean)
 3×2 DataFrame
@@ -124,23 +126,25 @@ julia> by(iris, :Species,
 │ 2   │ Iris-versicolor │ 0.717655 │ 213.0   │
 │ 3   │ Iris-virginica  │ 0.842744 │ 277.6   │
 
-julia> by(iris, :Species, 1:2, 1:2 .=> mean, nrow)
-150×6 DataFrame
-│ Row │ Species        │ SepalLength │ SepalWidth │ SepalLength_mean │ SepalWidth_mean │ nrow  │
-│     │ String         │ Float64     │ Float64    │ Float64          │ Float64         │ Int64 │
-├─────┼────────────────┼─────────────┼────────────┼──────────────────┼─────────────────┼───────┤
-│ 1   │ Iris-setosa    │ 5.1         │ 3.5        │ 5.006            │ 3.418           │ 50    │
-│ 2   │ Iris-setosa    │ 4.9         │ 3.0        │ 5.006            │ 3.418           │ 50    │
-│ 3   │ Iris-setosa    │ 4.7         │ 3.2        │ 5.006            │ 3.418           │ 50    │
-│ 4   │ Iris-setosa    │ 4.6         │ 3.1        │ 5.006            │ 3.418           │ 50    │
-│ 5   │ Iris-setosa    │ 5.0         │ 3.6        │ 5.006            │ 3.418           │ 50    │
-⋮
-│ 145 │ Iris-virginica │ 6.7         │ 3.3        │ 6.588            │ 2.974           │ 50    │
-│ 146 │ Iris-virginica │ 6.7         │ 3.0        │ 6.588            │ 2.974           │ 50    │
-│ 147 │ Iris-virginica │ 6.3         │ 2.5        │ 6.588            │ 2.974           │ 50    │
-│ 148 │ Iris-virginica │ 6.5         │ 3.0        │ 6.588            │ 2.974           │ 50    │
-│ 149 │ Iris-virginica │ 6.2         │ 3.4        │ 6.588            │ 2.974           │ 50    │
-│ 150 │ Iris-virginica │ 5.9         │ 3.0        │ 6.588            │ 2.974           │ 50    │
+julia> by(iris, :Species,
+          AsTable([:PetalLength, :SepalLength]) =>
+          x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
+3×2 DataFrame
+│ Row │ Species         │ PetalLength_SepalLength_function │
+│     │ String          │ Float64                          │
+├─────┼─────────────────┼──────────────────────────────────┤
+│ 1   │ Iris-setosa     │ 0.492245                         │
+│ 2   │ Iris-versicolor │ 0.910378                         │
+│ 3   │ Iris-virginica  │ 0.867923                         │
+
+julia> by(iris, :Species, 1:2 => cor, nrow)
+3×3 DataFrame
+│ Row │ Species         │ SepalLength_SepalWidth_cor │ nrow  │
+│     │ String          │ Float64                    │ Int64 │
+├─────┼─────────────────┼────────────────────────────┼───────┤
+│ 1   │ Iris-setosa     │ 0.74678                    │ 50    │
+│ 2   │ Iris-versicolor │ 0.525911                   │ 50    │
+│ 3   │ Iris-virginica  │ 0.457228                   │ 50    │
 
 ```
 

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -16,6 +16,7 @@ import DataAPI,
 
 export AbstractDataFrame,
        All,
+       AsTable,
        Between,
        ByRow,
        DataFrame,

diff --git a/src/abstractdataframe/abstractdataframe.jl b/src/abstractdataframe/abstractdataframe.jl
@@ -876,7 +876,8 @@ returns `true`.
 If `cols` is not specified then the function is passed `DataFrameRow`s.
 If `cols` is specified then it should be a valid column selector
 (column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
-the function is passed elements of the selected columns as separate positional arguments.
+the function is passed elements of the selected columns as separate positional arguments,
+unless it is an `AsTable` selector, in which case a `NamedTuple` of these arguments is passed.
 
 Passing `cols` leads to a more efficient execution of the operation for large data frames.
 
@@ -918,6 +919,15 @@ julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
 │ 1   │ 3     │ b      │
 │ 2   │ 1     │ c      │
 │ 3   │ 1     │ b      │
+
+julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
+3×2 DataFrame
+│ Row │ x     │ y      │
+│     │ Int64 │ String │
+├─────┼───────┼────────┤
+│ 1   │ 3     │ b      │
+│ 2   │ 1     │ c      │
+│ 3   │ 1     │ b      │
 ```
 """
 Base.filter(f, df::AbstractDataFrame) = _filter_helper(df, f, eachrow(df))
@@ -930,7 +940,23 @@ Base.filter((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
 Base.filter((cols, f)::Pair, df::AbstractDataFrame) =
     filter(index(df)[cols] => f, df)
 
-_filter_helper(df, f, cols...) = df[((x...) -> f(x...)::Bool).(cols...), :]
+function _filter_helper(df::AbstractDataFrame, f, cols...)
+    if length(cols) == 0
+        throw(ArgumentError("At least one column must be passed to filter on"))
+    end
+    return df[((x...) -> f(x...)::Bool).(cols...), :]
+end
+
+function Base.filter((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
+    dff = select(df, cols.cols, copycols=false)
+    if ncol(dff) == 0
+        throw(ArgumentError("At least one column must be passed to filter on"))
+    end
+    return _filter_helper_astable(df, Tables.namedtupleiterator(dff), f)
+end
+
+_filter_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
+    df[(x -> f(x)::Bool).(nti), :]
 
 """
     filter!(function, df::AbstractDataFrame)
@@ -940,7 +966,8 @@ Remove rows from data frame `df` for which `function` returns `false`.
 If `cols` is not specified then the function is passed `DataFrameRow`s.
 If `cols` is specified then it should be a valid column selector
 (column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
-the function is passed elements of the selected columns as separate positional arguments.
+the function is passed elements of the selected columns as separate positional arguments,
+unless it is `AsTable` selector in which case `NamedTuple`s of these arguments are passed.
 
 Passing `cols` leads to a more efficient execution of the operation for large data frames.
 
@@ -989,6 +1016,17 @@ julia> df
 │ 1   │ 3     │ b      │
 │ 2   │ 1     │ c      │
 │ 3   │ 1     │ b      │
+
+julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);
+
+julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
+3×2 DataFrame
+│ Row │ x     │ y      │
+│     │ Int64 │ String │
+├─────┼───────┼────────┤
+│ 1   │ 3     │ b      │
+│ 2   │ 1     │ c      │
+│ 3   │ 1     │ b      │
 ```
 """
 Base.filter!(f, df::AbstractDataFrame) = _filter!_helper(df, f, eachrow(df))
@@ -1001,8 +1039,23 @@ Base.filter!((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
 Base.filter!((cols, f)::Pair, df::AbstractDataFrame) =
     filter!(index(df)[cols] => f, df)
 
-_filter!_helper(df, f, cols...) =
-    deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
+function _filter!_helper(df::AbstractDataFrame, f, cols...)
+    if length(cols) == 0
+        throw(ArgumentError("At least one column must be passed to filter on"))
+    end
+    return deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
+end
+
+function Base.filter!((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
+    dff = select(df, cols.cols, copycols=false)
+    if ncol(dff) == 0
+        throw(ArgumentError("At least one column must be passed to filter on"))
+    end
+    return _filter!_helper_astable(df, Tables.namedtupleiterator(dff), f)
+end
+
+_filter!_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
+    deleterows!(df, findall((x -> !(f(x)::Bool)).(nti)))
 
 function Base.convert(::Type{Matrix}, df::AbstractDataFrame)
     T = reduce(promote_type, (eltype(v) for v in eachcol(df)))