Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AsTable wrapper, disallow NamedTuple in ByRow #2183

Merged
merged 16 commits into from
Apr 14, 2020
Merged
3 changes: 3 additions & 0 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ but they are columns of a `DataFrame` returned by `stack` with `view=true`.
The `ByRow` type is a special type used for selection operations to signal that the wrapped function should be applied
to each element (row) of the selection.

The `AsTable` type is a special type used for selection operations to signal that the columns selected by a wrapped
selector should be passed as a `NamedTuple` to the function.

## [The design of handling of columns of a `DataFrame`](@id man-columnhandling)

When a `DataFrame` is constructed columns are copied by default. You can disable
Expand Down
84 changes: 82 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -634,7 +634,9 @@ julia> df
```

`transform` and `transform!` functions work identically to `select` and `select!` with the only difference that
they retain all columns that are present in the source data frame, for example:
they retain all columns that are present in the source data frame. Here are some more advanced examples.

First we show how to generate a column that is a sum of all other columns in the data frame using `All()` selector:
bkamins marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest dataframe
julia> df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])
Expand All @@ -653,8 +655,86 @@ julia> transform(df, All() => +)
│ 1 │ 1 │ 3 │ 5 │ 9 │
│ 2 │ 2 │ 4 │ 6 │ 12 │
```
Here we wrap rows of the data frame into a `NamedTuple` while remembering
source column names.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
```
julia> transform(df, AsTable(:) => ByRow(identity))
2×4 DataFrame
│ Row │ x1 │ x2 │ y │ x1_x2_y_identity │
│ │ Int64 │ Int64 │ Int64 │ NamedTuple… │
├─────┼───────┼───────┼───────┼─────────────────────────┤
│ 1 │ 1 │ 3 │ 5 │ (x1 = 1, x2 = 3, y = 5) │
│ 2 │ 2 │ 4 │ 6 │ (x1 = 2, x2 = 4, y = 6) │
```
Note that the same column could be generated by using the `Tables.rowtable` function:
```
julia> Tables.rowtable(df)
2-element Array{NamedTuple{(:x1, :x2, :y),Tuple{Int64,Int64,Int64}},1}:
(x1 = 1, x2 = 3, y = 5)
(x1 = 2, x2 = 4, y = 6)
```
Now assume that a data frame `df` contains predictions from a model producing scores
for three levels `a`, `b` and `c` of a nominar target variable.
For each row we want to get the level with the highest score.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
```
julia> using Random

julia> Random.seed!(1);

julia> df = DataFrame(rand(10, 3), [:a, :b, :c])
10×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64 │ Float64 │ Float64 │
├─────┼────────────┼───────────┼───────────┤
│ 1 │ 0.236033 │ 0.555751 │ 0.0769509 │
│ 2 │ 0.346517 │ 0.437108 │ 0.640396 │
│ 3 │ 0.312707 │ 0.424718 │ 0.873544 │
│ 4 │ 0.00790928 │ 0.773223 │ 0.278582 │
│ 5 │ 0.488613 │ 0.28119 │ 0.751313 │
│ 6 │ 0.210968 │ 0.209472 │ 0.644883 │
│ 7 │ 0.951916 │ 0.251379 │ 0.0778264 │
│ 8 │ 0.999905 │ 0.0203749 │ 0.848185 │
│ 9 │ 0.251662 │ 0.287702 │ 0.0856352 │
│ 10 │ 0.986666 │ 0.859512 │ 0.553206 │

julia> transform(df, AsTable(:) => ByRow(argmax) => :prediction)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
10×4 DataFrame
│ Row │ a │ b │ c │ prediction │
│ │ Float64 │ Float64 │ Float64 │ Symbol │
├─────┼────────────┼───────────┼───────────┼────────────┤
│ 1 │ 0.236033 │ 0.555751 │ 0.0769509 │ b │
│ 2 │ 0.346517 │ 0.437108 │ 0.640396 │ c │
│ 3 │ 0.312707 │ 0.424718 │ 0.873544 │ c │
│ 4 │ 0.00790928 │ 0.773223 │ 0.278582 │ b │
│ 5 │ 0.488613 │ 0.28119 │ 0.751313 │ c │
│ 6 │ 0.210968 │ 0.209472 │ 0.644883 │ c │
│ 7 │ 0.951916 │ 0.251379 │ 0.0778264 │ a │
│ 8 │ 0.999905 │ 0.0203749 │ 0.848185 │ a │
│ 9 │ 0.251662 │ 0.287702 │ 0.0856352 │ b │
│ 10 │ 0.986666 │ 0.859512 │ 0.553206 │ a │
```
In the following, most complex, example below we compute row-wise sum, number of elements, and mean,
while ignoring missing values.
```
julia> using Statistics

julia> df = DataFrame(x=[1, 2, missing], y=[1, missing, missing]);

julia> transform(df, AsTable(:) .=>
ByRow.([sum∘skipmissing,
x -> count(!ismissing, x),
mean∘skipmissing]) .=>
[:sum, :n, :mean])
3×5 DataFrame
│ Row │ x │ y │ sum │ n │ mean │
│ │ Int64? │ Int64? │ Int64 │ Int64 │ Float64 │
├─────┼─────────┼─────────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 1 │ 2 │ 2 │ 1.0 │
│ 2 │ 2 │ missing │ 2 │ 1 │ 2.0 │
│ 3 │ missing │ missing │ 0 │ 0 │ NaN │
```

While the DataFrames package provides basic data manipulation capabilities,
While the DataFrames.jl package provides basic data manipulation capabilities,
users are encouraged to use querying frameworks for more convenient and powerful operations:
- the [Query.jl](https://github.com/davidanthoff/Query.jl) package provides a
[LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx)-like interface to a large number of data sources
Expand Down
88 changes: 46 additions & 42 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,12 @@ each subset of the `DataFrame`. This specification can be of the following forms

All forms except 1 and 6 can be also passed as the first argument to `map`.

Three are two special rules that apply to `cols => function` syntax:
1. if `cols` is wrapped in `AsTable` object then a `NamedTuple` containing columns
selected by `cols` is passed to `function`
2. if `function` is wrapped in a `ByRow` object, then it will be passed values from single
bkamins marked this conversation as resolved.
Show resolved Hide resolved
rows each group and always return a vector of values produced by `function` applied to them

In all of these cases, `function` can return either a single row or multiple rows.
`function` can always generate a single column by returning a single value or a vector.
Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
Expand Down Expand Up @@ -60,31 +66,27 @@ We show several examples of the `by` function applied to the `iris` dataset belo
```jldoctest sac
julia> using DataFrames, CSV, Statistics

julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")));

julia> first(iris, 6)
6×6 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │ id │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │ Int64 │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┼───────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ 1 │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ 2 │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ 3 │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │ 4 │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │ 5 │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │ 6 │

julia> last(iris, 6)
6×6 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │ id │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │ Int64 │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┼───────┤
│ 1 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │ 145 │
│ 2 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │ 146 │
│ 3 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │ 147 │
│ 4 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │ 148 │
│ 5 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │ 149 │
│ 6 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │ 150 │
julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")))
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
150×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │
│ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ Iris-setosa │
│ 143 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 144 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ Iris-virginica │
│ 145 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │
│ 146 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │
│ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │
│ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> by(iris, :Species, :PetalLength => mean)
3×2 DataFrame
Expand Down Expand Up @@ -124,23 +126,25 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.717655 │ 213.0 │
│ 3 │ Iris-virginica │ 0.842744 │ 277.6 │

julia> by(iris, :Species, 1:2, 1:2 .=> mean, nrow)
150×6 DataFrame
│ Row │ Species │ SepalLength │ SepalWidth │ SepalLength_mean │ SepalWidth_mean │ nrow │
│ │ String │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │
├─────┼────────────────┼─────────────┼────────────┼──────────────────┼─────────────────┼───────┤
│ 1 │ Iris-setosa │ 5.1 │ 3.5 │ 5.006 │ 3.418 │ 50 │
│ 2 │ Iris-setosa │ 4.9 │ 3.0 │ 5.006 │ 3.418 │ 50 │
│ 3 │ Iris-setosa │ 4.7 │ 3.2 │ 5.006 │ 3.418 │ 50 │
│ 4 │ Iris-setosa │ 4.6 │ 3.1 │ 5.006 │ 3.418 │ 50 │
│ 5 │ Iris-setosa │ 5.0 │ 3.6 │ 5.006 │ 3.418 │ 50 │
│ 145 │ Iris-virginica │ 6.7 │ 3.3 │ 6.588 │ 2.974 │ 50 │
│ 146 │ Iris-virginica │ 6.7 │ 3.0 │ 6.588 │ 2.974 │ 50 │
│ 147 │ Iris-virginica │ 6.3 │ 2.5 │ 6.588 │ 2.974 │ 50 │
│ 148 │ Iris-virginica │ 6.5 │ 3.0 │ 6.588 │ 2.974 │ 50 │
│ 149 │ Iris-virginica │ 6.2 │ 3.4 │ 6.588 │ 2.974 │ 50 │
│ 150 │ Iris-virginica │ 5.9 │ 3.0 │ 6.588 │ 2.974 │ 50 │
julia> by(iris, :Species,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
3×2 DataFrame
│ Row │ Species │ PetalLength_SepalLength_function │
│ │ String │ Float64 │
├─────┼─────────────────┼──────────────────────────────────┤
│ 1 │ Iris-setosa │ 0.492245 │
│ 2 │ Iris-versicolor │ 0.910378 │
│ 3 │ Iris-virginica │ 0.867923 │

julia> by(iris, :Species, 1:2 => cor, nrow)
3×3 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │ nrow │
│ │ String │ Float64 │ Int64 │
├─────┼─────────────────┼────────────────────────────┼───────┤
│ 1 │ Iris-setosa │ 0.74678 │ 50 │
│ 2 │ Iris-versicolor │ 0.525911 │ 50 │
│ 3 │ Iris-virginica │ 0.457228 │ 50 │

```

Expand Down
1 change: 1 addition & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import DataAPI,

export AbstractDataFrame,
All,
AsTable,
Between,
ByRow,
DataFrame,
Expand Down
63 changes: 58 additions & 5 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -876,7 +876,8 @@ returns `true`.
If `cols` is not specified then the function is passed `DataFrameRow`s.
If `cols` is specified then it should be a valid column selector
(column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
the function is passed elements of the selected columns as separate positional arguments.
the function is passed elements of the selected columns as separate positional arguments,
unless it is an `AsTable` selector, in which case a `NamedTuple` of these arguments is passed.

Passing `cols` leads to a more efficient execution of the operation for large data frames.

Expand Down Expand Up @@ -918,6 +919,15 @@ julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │

julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
```
"""
Base.filter(f, df::AbstractDataFrame) = _filter_helper(df, f, eachrow(df))
Expand All @@ -930,7 +940,23 @@ Base.filter((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
Base.filter((cols, f)::Pair, df::AbstractDataFrame) =
filter(index(df)[cols] => f, df)

_filter_helper(df, f, cols...) = df[((x...) -> f(x...)::Bool).(cols...), :]
function _filter_helper(df::AbstractDataFrame, f, cols...)
if length(cols) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return df[((x...) -> f(x...)::Bool).(cols...), :]
end

function Base.filter((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
dff = select(df, cols.cols, copycols=false)
if ncol(dff) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return _filter_helper_astable(df, Tables.namedtupleiterator(dff), f)
end

_filter_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
df[(x -> f(x)::Bool).(nti), :]

"""
filter!(function, df::AbstractDataFrame)
Expand All @@ -940,7 +966,8 @@ Remove rows from data frame `df` for which `function` returns `false`.
If `cols` is not specified then the function is passed `DataFrameRow`s.
If `cols` is specified then it should be a valid column selector
(column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
the function is passed elements of the selected columns as separate positional arguments.
the function is passed elements of the selected columns as separate positional arguments,
unless it is `AsTable` selector in which case `NamedTuple`s of these arguments are passed.

Passing `cols` leads to a more efficient execution of the operation for large data frames.

Expand Down Expand Up @@ -989,6 +1016,17 @@ julia> df
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
```
"""
Base.filter!(f, df::AbstractDataFrame) = _filter!_helper(df, f, eachrow(df))
Expand All @@ -1001,8 +1039,23 @@ Base.filter!((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
Base.filter!((cols, f)::Pair, df::AbstractDataFrame) =
filter!(index(df)[cols] => f, df)

_filter!_helper(df, f, cols...) =
deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
function _filter!_helper(df::AbstractDataFrame, f, cols...)
if length(cols) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
end

function Base.filter!((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
dff = select(df, cols.cols, copycols=false)
if ncol(dff) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return _filter!_helper_astable(df, Tables.namedtupleiterator(dff), f)
end

_filter!_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
deleterows!(df, findall((x -> !(f(x)::Bool)).(nti)))

function Base.convert(::Type{Matrix}, df::AbstractDataFrame)
T = reduce(promote_type, (eltype(v) for v in eachcol(df)))
Expand Down
Loading