Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AsTable wrapper, disallow NamedTuple in ByRow #2183

Merged
merged 16 commits into from
Apr 14, 2020
Merged
3 changes: 3 additions & 0 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,9 @@ but they are columns of a `DataFrame` returned by `stack` with `view=true`.
The `ByRow` type is a special type used for selection operations to signal that the wrapped function should be applied
to each element (row) of the selection.

The `AsTable` type is a special type used for selection operations to signal that the columns selected by a wrapped
selector should be passed as a `NamedTuple` to the function.

## [The design of handling of columns of a `DataFrame`](@id man-columnhandling)

When a `DataFrame` is constructed columns are copied by default. You can disable
Expand Down
65 changes: 63 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -634,7 +634,10 @@ julia> df
```

`transform` and `transform!` functions work identically to `select` and `select!` with the only difference that
they retain all columns that are present in the source data frame, for example:
they retain all columns that are present in the source data frame. Here are some more advanced examples.

First we show how to generate a column that is a sum of all other columns in the data frame
using the `All()` selector:

```jldoctest dataframe
julia> df = DataFrame(x1=[1, 2], x2=[3, 4], y=[5, 6])
Expand All @@ -653,8 +656,66 @@ julia> transform(df, All() => +)
│ 1 │ 1 │ 3 │ 5 │ 9 │
│ 2 │ 2 │ 4 │ 6 │ 12 │
```
With this approach, we can easily compute for each row the name of column with the highest score:
bkamins marked this conversation as resolved.
Show resolved Hide resolved
```
julia> using Random

julia> Random.seed!(1);

julia> df = DataFrame(rand(10, 3), [:a, :b, :c])
10×3 DataFrame
│ Row │ a │ b │ c │
│ │ Float64 │ Float64 │ Float64 │
├─────┼────────────┼───────────┼───────────┤
│ 1 │ 0.236033 │ 0.555751 │ 0.0769509 │
│ 2 │ 0.346517 │ 0.437108 │ 0.640396 │
│ 3 │ 0.312707 │ 0.424718 │ 0.873544 │
│ 4 │ 0.00790928 │ 0.773223 │ 0.278582 │
│ 5 │ 0.488613 │ 0.28119 │ 0.751313 │
│ 6 │ 0.210968 │ 0.209472 │ 0.644883 │
│ 7 │ 0.951916 │ 0.251379 │ 0.0778264 │
│ 8 │ 0.999905 │ 0.0203749 │ 0.848185 │
│ 9 │ 0.251662 │ 0.287702 │ 0.0856352 │
│ 10 │ 0.986666 │ 0.859512 │ 0.553206 │

julia> transform(df, AsTable(:) => ByRow(argmax) => :prediction)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
10×4 DataFrame
│ Row │ a │ b │ c │ prediction │
│ │ Float64 │ Float64 │ Float64 │ Symbol │
├─────┼────────────┼───────────┼───────────┼────────────┤
│ 1 │ 0.236033 │ 0.555751 │ 0.0769509 │ b │
│ 2 │ 0.346517 │ 0.437108 │ 0.640396 │ c │
│ 3 │ 0.312707 │ 0.424718 │ 0.873544 │ c │
│ 4 │ 0.00790928 │ 0.773223 │ 0.278582 │ b │
│ 5 │ 0.488613 │ 0.28119 │ 0.751313 │ c │
│ 6 │ 0.210968 │ 0.209472 │ 0.644883 │ c │
│ 7 │ 0.951916 │ 0.251379 │ 0.0778264 │ a │
│ 8 │ 0.999905 │ 0.0203749 │ 0.848185 │ a │
│ 9 │ 0.251662 │ 0.287702 │ 0.0856352 │ b │
│ 10 │ 0.986666 │ 0.859512 │ 0.553206 │ a │
```
In the following, most complex, example below we compute row-wise sum, number of elements, and mean,
while ignoring missing values.
```
julia> using Statistics

julia> df = DataFrame(x=[1, 2, missing], y=[1, missing, missing]);

julia> transform(df, AsTable(:) .=>
ByRow.([sum∘skipmissing,
x -> count(!ismissing, x),
mean∘skipmissing]) .=>
[:sum, :n, :mean])
3×5 DataFrame
│ Row │ x │ y │ sum │ n │ mean │
│ │ Int64? │ Int64? │ Int64 │ Int64 │ Float64 │
├─────┼─────────┼─────────┼───────┼───────┼─────────┤
│ 1 │ 1 │ 1 │ 2 │ 2 │ 1.0 │
│ 2 │ 2 │ missing │ 2 │ 1 │ 2.0 │
│ 3 │ missing │ missing │ 0 │ 0 │ NaN │
```

While the DataFrames package provides basic data manipulation capabilities,
While the DataFrames.jl package provides basic data manipulation capabilities,
users are encouraged to use querying frameworks for more convenient and powerful operations:
- the [Query.jl](https://github.com/davidanthoff/Query.jl) package provides a
[LINQ](https://msdn.microsoft.com/en-us/library/bb397926.aspx)-like interface to a large number of data sources
Expand Down
93 changes: 48 additions & 45 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ each subset of the `DataFrame`. This specification can be of the following forms
2. a `cols => function` pair indicating that `function` should be called with
positional arguments holding columns `cols`, which can be a any valid column selector
3. a `cols => function => target_col` form additionally
specifying the name of the target column (this assumes that `function` returns a single value or a vector)
specifying the name of the target column (this assumes that `function` returns a single
value or a vector)
4. a `col => target_col` pair, which renames the column `col` to `target_col`
5. a `nrow` or `nrow => target_col` form which efficiently computes the number of rows in a group
(without `target_col` the new column is called `:nrow`)
5. a `nrow` or `nrow => target_col` form which efficiently computes the number of rows
in a group (without `target_col` the new column is called `:nrow`)
6. several arguments of the forms given above, or vectors thereof
7. a function which will be called with a `SubDataFrame` corresponding to each group;
this form should be avoided due to its poor performance unless a very large
Expand All @@ -28,6 +29,10 @@ each subset of the `DataFrame`. This specification can be of the following forms

All forms except 1 and 6 can be also passed as the first argument to `map`.

As a special rule that applies to `cols => function` syntax, if `cols` is wrapped
in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is
passed to `function`.

In all of these cases, `function` can return either a single row or multiple rows.
`function` can always generate a single column by returning a single value or a vector.
Additionally, if `by` is passed exactly one `function` and `target_col` is not specified,
Expand Down Expand Up @@ -60,31 +65,27 @@ We show several examples of the `by` function applied to the `iris` dataset belo
```jldoctest sac
julia> using DataFrames, CSV, Statistics

julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")));

julia> first(iris, 6)
6×6 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │ id │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │ Int64 │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────────┼───────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │ 1 │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │ 2 │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │ 3 │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │ 4 │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │ 5 │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │ 6 │

julia> last(iris, 6)
6×6 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │ id │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │ Int64 │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┼───────┤
│ 1 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │ 145 │
│ 2 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │ 146 │
│ 3 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │ 147 │
│ 4 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │ 148 │
│ 5 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │ 149 │
│ 6 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │ 150 │
julia> iris = DataFrame(CSV.File(joinpath(dirname(pathof(DataFrames)), "../docs/src/assets/iris.csv")))
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
150×5 DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
│ │ Float64 │ Float64 │ Float64 │ Float64 │ String │
├─────┼─────────────┼────────────┼─────────────┼────────────┼────────────────┤
│ 1 │ 5.1 │ 3.5 │ 1.4 │ 0.2 │ Iris-setosa │
│ 2 │ 4.9 │ 3.0 │ 1.4 │ 0.2 │ Iris-setosa │
│ 3 │ 4.7 │ 3.2 │ 1.3 │ 0.2 │ Iris-setosa │
│ 4 │ 4.6 │ 3.1 │ 1.5 │ 0.2 │ Iris-setosa │
│ 5 │ 5.0 │ 3.6 │ 1.4 │ 0.2 │ Iris-setosa │
│ 6 │ 5.4 │ 3.9 │ 1.7 │ 0.4 │ Iris-setosa │
│ 7 │ 4.6 │ 3.4 │ 1.4 │ 0.3 │ Iris-setosa │
│ 143 │ 5.8 │ 2.7 │ 5.1 │ 1.9 │ Iris-virginica │
│ 144 │ 6.8 │ 3.2 │ 5.9 │ 2.3 │ Iris-virginica │
│ 145 │ 6.7 │ 3.3 │ 5.7 │ 2.5 │ Iris-virginica │
│ 146 │ 6.7 │ 3.0 │ 5.2 │ 2.3 │ Iris-virginica │
│ 147 │ 6.3 │ 2.5 │ 5.0 │ 1.9 │ Iris-virginica │
│ 148 │ 6.5 │ 3.0 │ 5.2 │ 2.0 │ Iris-virginica │
│ 149 │ 6.2 │ 3.4 │ 5.4 │ 2.3 │ Iris-virginica │
│ 150 │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ Iris-virginica │

julia> by(iris, :Species, :PetalLength => mean)
3×2 DataFrame
Expand Down Expand Up @@ -124,23 +125,25 @@ julia> by(iris, :Species,
│ 2 │ Iris-versicolor │ 0.717655 │ 213.0 │
│ 3 │ Iris-virginica │ 0.842744 │ 277.6 │

julia> by(iris, :Species, 1:2, 1:2 .=> mean, nrow)
150×6 DataFrame
│ Row │ Species │ SepalLength │ SepalWidth │ SepalLength_mean │ SepalWidth_mean │ nrow │
│ │ String │ Float64 │ Float64 │ Float64 │ Float64 │ Int64 │
├─────┼────────────────┼─────────────┼────────────┼──────────────────┼─────────────────┼───────┤
│ 1 │ Iris-setosa │ 5.1 │ 3.5 │ 5.006 │ 3.418 │ 50 │
│ 2 │ Iris-setosa │ 4.9 │ 3.0 │ 5.006 │ 3.418 │ 50 │
│ 3 │ Iris-setosa │ 4.7 │ 3.2 │ 5.006 │ 3.418 │ 50 │
│ 4 │ Iris-setosa │ 4.6 │ 3.1 │ 5.006 │ 3.418 │ 50 │
│ 5 │ Iris-setosa │ 5.0 │ 3.6 │ 5.006 │ 3.418 │ 50 │
│ 145 │ Iris-virginica │ 6.7 │ 3.3 │ 6.588 │ 2.974 │ 50 │
│ 146 │ Iris-virginica │ 6.7 │ 3.0 │ 6.588 │ 2.974 │ 50 │
│ 147 │ Iris-virginica │ 6.3 │ 2.5 │ 6.588 │ 2.974 │ 50 │
│ 148 │ Iris-virginica │ 6.5 │ 3.0 │ 6.588 │ 2.974 │ 50 │
│ 149 │ Iris-virginica │ 6.2 │ 3.4 │ 6.588 │ 2.974 │ 50 │
│ 150 │ Iris-virginica │ 5.9 │ 3.0 │ 6.588 │ 2.974 │ 50 │
julia> by(iris, :Species,
AsTable([:PetalLength, :SepalLength]) =>
x -> std(x.PetalLength) / std(x.SepalLength)) # passing a NamedTuple
3×2 DataFrame
│ Row │ Species │ PetalLength_SepalLength_function │
│ │ String │ Float64 │
├─────┼─────────────────┼──────────────────────────────────┤
│ 1 │ Iris-setosa │ 0.492245 │
│ 2 │ Iris-versicolor │ 0.910378 │
│ 3 │ Iris-virginica │ 0.867923 │

julia> by(iris, :Species, 1:2 => cor, nrow)
3×3 DataFrame
│ Row │ Species │ SepalLength_SepalWidth_cor │ nrow │
│ │ String │ Float64 │ Int64 │
├─────┼─────────────────┼────────────────────────────┼───────┤
│ 1 │ Iris-setosa │ 0.74678 │ 50 │
│ 2 │ Iris-versicolor │ 0.525911 │ 50 │
│ 3 │ Iris-virginica │ 0.457228 │ 50 │

```

Expand Down
1 change: 1 addition & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ import DataAPI,

export AbstractDataFrame,
All,
AsTable,
Between,
ByRow,
DataFrame,
Expand Down
63 changes: 58 additions & 5 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -887,7 +887,8 @@ returns `true`.
If `cols` is not specified then the function is passed `DataFrameRow`s.
If `cols` is specified then it should be a valid column selector
(column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
the function is passed elements of the selected columns as separate positional arguments.
the function is passed elements of the selected columns as separate positional arguments,
unless it is an `AsTable` selector, in which case a `NamedTuple` of these arguments is passed.

Passing `cols` leads to a more efficient execution of the operation for large data frames.

Expand Down Expand Up @@ -929,6 +930,15 @@ julia> filter([:x, :y] => (x, y) -> x == 1 || y == "b", df)
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │

julia> filter(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
```
"""
Base.filter(f, df::AbstractDataFrame) = _filter_helper(df, f, eachrow(df))
Expand All @@ -941,7 +951,23 @@ Base.filter((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
Base.filter((cols, f)::Pair, df::AbstractDataFrame) =
filter(index(df)[cols] => f, df)

_filter_helper(df, f, cols...) = df[((x...) -> f(x...)::Bool).(cols...), :]
function _filter_helper(df::AbstractDataFrame, f, cols...)
if length(cols) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return df[((x...) -> f(x...)::Bool).(cols...), :]
end

function Base.filter((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
dff = select(df, cols.cols, copycols=false)
if ncol(dff) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return _filter_helper_astable(df, Tables.namedtupleiterator(dff), f)
end

_filter_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
df[(x -> f(x)::Bool).(nti), :]

"""
filter!(function, df::AbstractDataFrame)
Expand All @@ -951,7 +977,8 @@ Remove rows from data frame `df` for which `function` returns `false`.
If `cols` is not specified then the function is passed `DataFrameRow`s.
If `cols` is specified then it should be a valid column selector
(column duplicates are allowed if a vector of `Int` or `Symbol` is passed),
the function is passed elements of the selected columns as separate positional arguments.
the function is passed elements of the selected columns as separate positional arguments,
unless it is `AsTable` selector in which case `NamedTuple`s of these arguments are passed.

Passing `cols` leads to a more efficient execution of the operation for large data frames.

Expand Down Expand Up @@ -1000,6 +1027,17 @@ julia> df
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │

julia> df = DataFrame(x = [3, 1, 2, 1], y = ["b", "c", "a", "b"]);

julia> filter!(AsTable(:) => nt -> nt.x == 1 || nt.y == "b", df)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ String │
├─────┼───────┼────────┤
│ 1 │ 3 │ b │
│ 2 │ 1 │ c │
│ 3 │ 1 │ b │
```
"""
Base.filter!(f, df::AbstractDataFrame) = _filter!_helper(df, f, eachrow(df))
Expand All @@ -1012,8 +1050,23 @@ Base.filter!((cols, f)::Pair{<:AbstractVector{Symbol}}, df::AbstractDataFrame) =
Base.filter!((cols, f)::Pair, df::AbstractDataFrame) =
filter!(index(df)[cols] => f, df)

_filter!_helper(df, f, cols...) =
deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
function _filter!_helper(df::AbstractDataFrame, f, cols...)
if length(cols) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return deleterows!(df, findall(((x...) -> !(f(x...)::Bool)).(cols...)))
end

function Base.filter!((cols, f)::Pair{<:AsTable}, df::AbstractDataFrame)
dff = select(df, cols.cols, copycols=false)
if ncol(dff) == 0
throw(ArgumentError("At least one column must be passed to filter on"))
end
return _filter!_helper_astable(df, Tables.namedtupleiterator(dff), f)
end

_filter!_helper_astable(df::AbstractDataFrame, nti::Tables.NamedTupleIterator, f) =
deleterows!(df, findall((x -> !(f(x)::Bool)).(nti)))

function Base.convert(::Type{Matrix}, df::AbstractDataFrame)
T = reduce(promote_type, (eltype(v) for v in eachcol(df)))
Expand Down
Loading