Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add filter and subset to documentation #2900

Merged
merged 4 commits into from
Oct 10, 2021
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 46 additions & 40 deletions docs/src/man/working_with_dataframes.md
Original file line number Diff line number Diff line change
Expand Up @@ -427,45 +427,17 @@ a function object that tests whether each value belongs to the subset
More details on copies, views, and references can be found
in the [`getindex` and `view`](@ref) section.

bkamins marked this conversation as resolved.
Show resolved Hide resolved
An alternative approach to row subsetting in a data frame is to use
[`filter`](@ref), [`filter!`](@ref), [`subset`](@ref), or [`subset!`](@ref)
functions (the functions with names ending with `!` are in-place variants).

The [`filter`](@ref) function can be applied using two alternative syntaxes.
The first one assumes that a predicate function taking a `DataFrameRow` is
passed as a first positional argument to it:

```jldoctest dataframe
julia> filter(row -> 5 < row.A < 10, df)
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 7 1 4
2 │ 9 1 5
```

The second syntax assumes that the user passes a `Pair` of column name and
predicate function taking a single value as an argument:

```jldoctest dataframe
julia> filter(:A => a -> 5 < a < 10, df)
2×3 DataFrame
Row │ A B C
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 7 1 4
2 │ 9 1 5
```
### Subsetting functions

The latter syntax is faster because the performed operation is type stable,
so it is preferred for large data frames.
An alternative approach to row subsetting in a data frame is to use
the [`subset`](@ref) function, or the [`subset!`](@ref) function,
which is its in-place variant.

The [`subset`](@ref) function differs from the [`filter`](@ref) function in that
it takes a data frame as its first argument, it operates on whole columns, and
allows to pass more than one condition at a time. Each condition should be passed
as a `Pair` consisting of source column and a function specifying the filtering
condition:
The [`subset`](@ref) function takes a data frame as its first argument. The
bkamins marked this conversation as resolved.
Show resolved Hide resolved
following one or more positional arguments are filtering condition
bkamins marked this conversation as resolved.
Show resolved Hide resolved
specifications that must be jointly met. Each condition should be passed as a
`Pair` consisting of source column(s) and a function specifying the filtering
condition taking this column(s) as arguments:
bkamins marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest dataframe
julia> subset(df, :A => a -> a .< 10, :C => c -> isodd.(c))
Expand All @@ -478,9 +450,43 @@ julia> subset(df, :A => a -> a .< 10, :C => c -> isodd.(c))
3 │ 9 1 5
```

Please check the documentation strings of the [`filter`](@ref),
[`filter!`](@ref), [`subset`](@ref), and [`subset!`](@ref) functions to learn
about all options that these functions provide.
It is a frequent situation that when performing filtering `missing` values
might be present in the filtered columns which could then lead `missing`
value as a filtering condition instead of expected `true` or `false`. In order
to handle this situation one can either use the `coalesce` function or pass
`skipmissing=true` keyword argument to `subset`. Here is an example:
bkamins marked this conversation as resolved.
Show resolved Hide resolved

```jldoctest dataframe
julia> df = DataFrame(x=[1, 2, missing, 4])
4×1 DataFrame
Row │ x
│ Int64?
─────┼─────────
1 │ 1
2 │ 2
3 │ missing
4 │ 4

julia> subset(df, :x => x -> coalesce.(iseven.(x), false))
2×1 DataFrame
Row │ x
│ Int64?
─────┼────────
1 │ 2
2 │ 4

julia> subset(df, :x => x -> iseven.(x), skipmissing=true)
2×1 DataFrame
Row │ x
│ Int64?
─────┼────────
1 │ 2
2 │ 4
```

Additionally DataFrames.jl extends the [`filter`](@ref), [`filter!`](@ref)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
functions provided in Julia Base and they also allow to subset a data frame.
Please refer to their documentation for the details.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

It is worth to mention that the [`subset`](@ref) was designed in a way that is
consistent how column transformations are specified in functions like
Expand Down