Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple uniqueness checks for sorting-related functions #3312

Merged
merged 31 commits into from
Jun 19, 2023
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
376b569
Basic support for checkunique kwarg in sorting-related functions
alonsoC1s Apr 7, 2023
b9fc81d
Added tests for the checkunique kwarg
alonsoC1s Apr 8, 2023
c35b338
Fixing Tests + Changing Error to ArgumentError
alonsoC1s Apr 8, 2023
9de0421
Accounting for the case with order(...) clauses
alonsoC1s Apr 9, 2023
01761ee
Improved tests for selectors with checkunique
alonsoC1s Apr 11, 2023
add0ad6
Fixed broken uniqueness testing
alonsoC1s Apr 11, 2023
e310687
Merge branch 'JuliaData:main' into simple_uniqueness_sorting
alonsoC1s Apr 12, 2023
7de6e70
Added missing tests for sort
alonsoC1s Apr 12, 2023
262db54
Fixed Docstrings + Replace == with ===
alonsoC1s Apr 13, 2023
d433cf1
Apply suggestions from code review
alonsoC1s Apr 19, 2023
f7d2ad9
Fixed sorting with order kwarg
alonsoC1s Apr 21, 2023
bdff445
Fixed complex sorting order detection
alonsoC1s Apr 21, 2023
d68dda9
Improved tests for complex order clauses
alonsoC1s Apr 22, 2023
651eab6
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Apr 22, 2023
f49023d
Fixed logic error in is_complex + added tests
alonsoC1s Apr 24, 2023
73a31bb
Apply suggestions from code review
alonsoC1s Apr 28, 2023
5ef1e4d
More robust complexity checks for Order.Orderings + tests
alonsoC1s May 18, 2023
e5bd016
Update src/abstractdataframe/sort.jl
alonsoC1s May 18, 2023
bac221b
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s May 18, 2023
b672135
Added error message for unsupported Order types
alonsoC1s May 18, 2023
8746370
Added complexity checks and tests for other subtypes of Order
alonsoC1s May 21, 2023
1576c07
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 1, 2023
2a71db8
Finalizing checks for multiple Ordering subtypes + tests
alonsoC1s Jun 6, 2023
ce79d07
Update src/abstractdataframe/sort.jl
bkamins Jun 12, 2023
536926a
Update src/abstractdataframe/sort.jl
alonsoC1s Jun 13, 2023
cf59d15
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 13, 2023
d333474
Stylistic changes
alonsoC1s Jun 13, 2023
ac25bb5
Merge branch 'main' into simple_uniqueness_sorting
alonsoC1s Jun 16, 2023
e002b7b
Update src/abstractdataframe/sort.jl
bkamins Jun 16, 2023
56c228e
Removed explanation from sort docstring
alonsoC1s Jun 16, 2023
929cacd
Apply suggestions from code review
bkamins Jun 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
treated as if they were wrapped in `Cols` and does not throw an error
when a vector of duplicate indices is passed when doing column selection
([#3302](https://github.com/JuliaData/DataFrames.jl/pull/3302))
* Added the kwarg `checkunique` to sorting related functions (`issorted`,
`sort`, `sort!` and `sortperm`) that throws an error when duplicate elements
make multiple sort orders valid
([#3312](https://github.com/JuliaData/DataFrames.jl/pull/3312))
* `reduce` performing `vcat` on a collection of data frames
now accepts `init` keyword argument
([#3310](https://github.com/JuliaData/DataFrames.jl/pull/3310))
Expand All @@ -28,6 +32,7 @@
from a data frame is its column or might alias with some of its columns
([#3304](https://github.com/JuliaData/DataFrames.jl/pull/3304))


# DataFrames.jl v1.5 Release Notes

## New functionalities
Expand Down
154 changes: 140 additions & 14 deletions src/abstractdataframe/sort.jl
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,9 @@
# which allows a user to specify column specific orderings
# with "order(column, rev=true, ...)"

import SortingAlgorithms.DataStructures.FasterForward,
SortingAlgorithms.DataStructures.FasterReverse

struct UserColOrdering{T<:ColumnIndex}
col::T
kwargs
Expand Down Expand Up @@ -341,23 +344,33 @@ If `rev` is `true`, reverse sorting is performed. To enable reverse sorting only
for some columns, pass `order(c, rev=true)` in `cols`, with `c` the
corresponding column index (see example below).

Since having repeated elements makes multiple sorting orders valid, the
`checkunique` keyword allows for the situation to be caught. If `checkunique` is
`true` and duplicate elements are found an error will be thrown. The
use of the `checkunique` keyword is only supported when neither the `by` nor
`lt` keywords are being used, as their application can create duplicate items
inadvertently. Similarly, the use of `order(...)` clauses that specify either
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the reasoning behind "as their application can create duplicate items inadvertently". Doesn't that make that argument precisely useful to protect against inadvertent introduction of duplicate elements? We don't have to support this right away, but no need to provide misleading reasons for that. ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments!
My goal was to explain that the disallowed combination is not unsupported because we ran out of time, but rather than the process of sorting after the use of the by and lt functions is a bit unintuitive, as the actual sorting happens on numbers that the user never sees. This would make it harder(er) to reason about why a particular dataframe and an exact copy look different even after being sorted the same way (which I think isn't supposed to happen because sorting is stable, but was something brought up on the issue that motivates this PR)

In other words, the message tries to say "we assume you are concerned about duplicates, and it isn't always obvious why the use of by and lt can create "invisible duplicates" at the time of sorting, and that is why we still don't support it at the moment"

I don't know if I convey that properly in the current docstring, I'd love to hear your take

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just propose to remove the justification part. It is not strictly needed in the docstring (it needs to describe precisely what is being supported). It is just not supported now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I'll remove them and rework the docstrings a bit

bkamins marked this conversation as resolved.
Show resolved Hide resolved
`by` or `lt` are not supported, but specifying `rev` by itself is allowed.

The `by` keyword allows providing a function that will be applied to each
cell before comparison; the `lt` keyword allows providing a custom "less
than" function. If both `by` and `lt` are specified, the `lt` function is
applied to the result of the `by` function.

All the keyword arguments can be either a single value, which is applied to
all columns, or a vector of length equal to the number of columns that the
operation is performed on. In such a case each entry is used for the
column in the corresponding position in `cols`.
Keyword arguments specifying sorting order (`rev`, `lt` or `by`) can either be
a single value, or a vector of length equal to the number of columns the
operation is performed on. When a single value is passed, it applies to all
columns. When a vector is passed, each entry applies to the column in the
corresponding position in `cols`.
"""

"""
issorted(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
checkunique::Bool=false)

Test whether data frame `df` sorted by column(s) `cols`. Checking against
multiple columns is done lexicographically.
Expand Down Expand Up @@ -397,14 +410,16 @@ function Base.issorted(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
to_scalar(x::AbstractVector) = only(x)
to_scalar(x::Any) = x

# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
checkunique && _perform_uniqueness_checks(df, cols, lt, by, order)
if cols isa ColumnIndex
return issorted(df[!, cols], lt=to_scalar(lt), by=to_scalar(by),
rev=to_scalar(rev), order=to_scalar(order))
Expand All @@ -427,7 +442,8 @@ Base.issorted(::AbstractDataFrame, ::Base.Order.Ordering) =
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false)
view::Bool=false,
checkunique::Bool=false)

Return a data frame containing the rows in `df` sorted by column(s) `cols`.
Sorting on multiple columns is done lexicographically.
Expand Down Expand Up @@ -506,8 +522,10 @@ julia> sort(df, [:x, order(:y, rev=true)])
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
view::Bool=false)
rowidxs = sortperm(df, cols, alg=alg, lt=lt, by=by, rev=rev, order=order)
view::Bool=false,
checkunique::Bool=false)
rowidxs = sortperm(df, cols, alg=alg, lt=lt, by=by, rev=rev, order=order,
checkunique=checkunique)
return view ? Base.view(df, rowidxs, :) : df[rowidxs, :]
end

Expand All @@ -517,7 +535,8 @@ end
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

Return a permutation vector of row indices of data frame `df` that puts them in
sorted order according to column(s) `cols`.
Expand Down Expand Up @@ -579,13 +598,15 @@ function Base.sortperm(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)
# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
checkunique && _perform_uniqueness_checks(df, cols, lt, by, order)
return _sortperm(df, _alg, ord)
end

Expand All @@ -601,7 +622,8 @@ _sortperm(df::AbstractDataFrame, a::Algorithm, o::Ordering) =
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

Sort data frame `df` by column(s) `cols` by permuting its rows in-place.
Sorting on multiple columns is done lexicographicallly.
Expand Down Expand Up @@ -682,16 +704,120 @@ function Base.sort!(df::AbstractDataFrame, cols=All();
lt::Union{Function, AbstractVector{<:Function}}=isless,
by::Union{Function, AbstractVector{<:Function}}=identity,
rev::Union{Bool, AbstractVector{Bool}}=false,
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward)
order::Union{Ordering, AbstractVector{<:Ordering}}=Forward,
checkunique::Bool=false)

# exclude AbstractVector as in that case cols can contain order(...) clauses
if cols isa MultiColumnIndex && !(cols isa AbstractVector)
cols = index(df)[cols]
end
ord = ordering(df, cols, lt, by, rev, order)
_alg = Sort.defalg(df, ord; alg=alg, cols=cols)
checkunique && _perform_uniqueness_checks(df, cols, lt, by, order)
alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
return sort!(df, _alg, ord)
end

Base.sort!(df::AbstractDataFrame, a::Base.Sort.Algorithm, o::Base.Sort.Ordering) =
# this sort! method does not support uniqueness checks since they can't be carried out
# without knowledge of which columns are to be sorted.
function Base.sort!(df::AbstractDataFrame, a::Base.Sort.Algorithm,
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
o::Base.Sort.Ordering)
permute!(df, _sortperm(df, a, o))
end

# Functions to verify whether an Ordering has user-defined `by` or `lt` functions
# The complexity checks exploit `Base.Order`'s way of constructing `Order` objects.
# `DirectOrdering` is by definition an order that uses `isless` and `identity` as its
# `lt` and `by` functions, so they are not complex. The `By` and `Lt` types have
# attributes storing the defined `by` and `lt` functions respectively, simple identity
# checks are enough. `ReverseOrdering`s wrap another type of ordering, so we
# perform the check on the wrapped type.
is_complex(o::Union{DirectOrdering, FasterForward, FasterReverse}) = false
is_complex(o::By) = o.by !== identity
is_complex(o::Lt) = o.lt !== isless
is_complex(o::ReverseOrdering) = is_complex(o.fwd)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
is_complex(o::Perm) = is_complex(o.order)

function is_complex(o::DFPerm)
if o.ord isa Ordering
return is_complex(o.ord)
elseif o.ord isa Tuple
return any(is_complex(ordering) for ordering = o.ord)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
end
end
bkamins marked this conversation as resolved.
Show resolved Hide resolved

function is_complex(o::Ordering)
throw(ArgumentError("The use of the keyword `checkunique` is currently " *
"not supported with Ordering type $(typeof(o))"))
end

function is_complex(o::UserColOrdering)
has_lt = haskey(o.kwargs, :lt)
has_by = haskey(o.kwargs, :by)
if !has_lt && !has_by
return false
elseif has_lt && !has_by
return o.kwargs[:lt] !== isless
elseif has_by && !has_lt
return o.kwargs[:by] !== identity
else
@assert has_lt && has_by
return o.kwargs[:by] !== identity || o.kwargs[:lt] !== isless
end
end

# Internal function that aids in uniqueness checks
# Converts column selectors to indices and checks necessary conditions for uniqueness
function _perform_uniqueness_checks(df::AbstractDataFrame, cols,
lt::Union{Function, AbstractVector{<:Function}},
by::Union{Function, AbstractVector{<:Function}},
order::Union{Ordering, AbstractVector{<:Ordering}})

if !(lt === isless && by === identity)
throw(ArgumentError("Passing either lt or by along with checkunique=" *
"true is not supported."))
end

# Validating the order argument
if order isa Ordering
order = [order]
end
for o in order
if is_complex(o)
throw(ArgumentError("Using either lt or by functions through the " *
"order keyword argument simultaneously with " *
"checkunique=true is not supported."))
end
end

# Easiest case, cols contains column indexes already
if cols isa AbstractVector{<:ColumnIndex}
by_or_lt_set = false
col_idxs = cols
# Second easiest, multicol index (no vector with orders clauses mixed in)
elseif (cols isa MultiColumnIndex && !(cols isa AbstractVector)) || cols isa ColumnIndex
by_or_lt_set = false
col_idxs = index(df)[cols]
elseif cols isa UserColOrdering
by_or_lt_set = is_complex(cols)
col_idxs = [index(df)[_getcol(cols)]]
# Mix of ColOrdering and other ColumnSelectors
else
@assert cols isa AbstractVector
newcols = Int[]
by_or_lt_set = false
for col in cols
if col isa UserColOrdering
by_or_lt_set = is_complex(col) || by_or_lt_set
end

alonsoC1s marked this conversation as resolved.
Show resolved Hide resolved
push!(newcols, index(df)[_getcol(col)])
end
col_idxs = newcols
end
if by_or_lt_set
throw(ArgumentError("Order clauses with either by or lt set in combination " *
"with checkunique=true are not supported"))
end
allunique(df, col_idxs) ||
throw(ArgumentError("Non-unique elements found. Multiple orders are valid."))
end
Loading