Add sorting options to groupby #3253

bkamins · 2022-12-19T13:33:03Z

Fixes #3251

docs/src/man/split_apply_combine.md

src/groupeddataframe/groupeddataframe.jl

test/grouping.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

… into bk/groupby

src/groupeddataframe/groupeddataframe.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

test/grouping.jl

Co-authored-by: Milan Bouchet-Valat <[email protected]>

bkamins · 2022-12-27T13:20:35Z

Thank you!

jariji · 2022-12-27T19:11:50Z

test/grouping.jl

+@testset "sorting API" begin
+    # simple tests
+    df = DataFrame(x=["b", "c", "b", "a", "c"])
+    @test getindex.(keys(groupby(df, :x)), 1) == ["b", "c", "a"]


I think these can use only instead of getindex(_, 1).

indeed it could. I initially had this test implemented on groups not on keys and using getindex works on both.

jariji · 2022-12-27T19:18:49Z

test/grouping.jl

+    @test getindex.(keys(groupby(df, :x, sort=true)), 1) == [1, 2, 100]
+    @test getindex.(keys(groupby(df, :x, sort=NamedTuple())), 1) == [1, 2, 100]
+    @test getindex.(keys(groupby(df, :x, sort=false)), 1) == [2, 100, 1]
+    @test getindex.(keys(groupby(df, order(:x))), 1) == [1, 2, 100]


Having so many equivalent ways to specify sorting does seem a bit much? Not sure if it's worth doing anything about.

The issue is that sort etc. provide that many ways to specify sort order, so we cannot do anything about it.
What is the rationale behind it:

normally people will use the "global" settings like (rev=true,), which applies to all columns

however there are cases when you want to specify sorting order per column, e.g. [order(:x, rev=true), :y], where you reverse :x but sort on :y in ascending order. Therefore order "per column" is needed.

In general - this complexity is needed when one has several columns.

bkamins · 2022-12-27T21:29:48Z

@jariji - if you would be willing actually improving https://dataframes.juliadata.org/stable/man/sorting/ section of the manual would be welcome. I planned to do it at some point, but maybe you would be willing to give it a shot and give a more in-depth coverage of all sorting options (this PR just inherits the complexity we allow for there).

jariji · 2023-02-05T20:50:43Z

Is there an advantage to sorting during the groupby versus sorting the groups afterwards?

bkamins · 2023-02-05T21:00:55Z

There is no convenient way to sort the groups afterwards AFAICT. To get a desired order you would need to sort the data frame you groupby before grouping, which is not always desired.

jariji · 2023-02-05T21:08:21Z

Sorting before. I'm not sure what you mean it's not desired. Simply sort by the group columns before grouping.
Sorting after. Introduce sortgroups(df). This is more flexible since it doesn't require regrouping in order to resort.

bkamins · 2023-02-05T21:16:54Z

I'm not sure what you mean it's not desired. Simply sort by the group columns before grouping.

It is expensive (in terms of time and memory)

Introduce sortgroups(df). This is more flexible since it doesn't require regrouping in order to resort.

Sorting while grouping will be faster and more convenient if user wants groups sorted (and most likely this is a most typical use case where user knows upfront how one wants groups to be sorted).

sortgroups(df) can be added as a separate function indeed. The question is how often re-sorting of groups is indeed needed (i.e. I would avoid adding a function that would be almost not used; the option to specify sorting order in groupby was added because users needed it).

bkamins added 2 commits December 19, 2022 14:31

add sorting order to groupby

870c0e6

add tests

806f878

bkamins requested a review from nalimilan December 19, 2022 13:33

bkamins added the feature label Dec 19, 2022

bkamins added this to the 1.5 milestone Dec 19, 2022

bkamins added 4 commits December 19, 2022 14:34

fix typo

4bee9dc

fix wrong logic

dfe9460

make a correct choice when using nt

669e439

minor fixes

35e7f04

nalimilan reviewed Dec 23, 2022

View reviewed changes

bkamins and others added 4 commits December 25, 2022 09:46

Apply suggestions from code review

3b6ad89

Co-authored-by: Milan Bouchet-Valat <[email protected]>

changes after code review

aee91ec

Merge branch 'bk/groupby' of https://github.com/JuliaData/DataFrames.jl…

ae9fc23

… into bk/groupby

Merge branch 'main' into bk/groupby

ecf1d1b

nalimilan approved these changes Dec 25, 2022

View reviewed changes

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved

src/groupeddataframe/groupeddataframe.jl Outdated Show resolved Hide resolved

bkamins and others added 4 commits December 25, 2022 12:26

Update src/groupeddataframe/groupeddataframe.jl

c8759e7

Co-authored-by: Milan Bouchet-Valat <[email protected]>

automatically set sort=NamedTuple() when order is passed

ca766d9

update manual

7bfaea0

fix test

4948ae8

nalimilan approved these changes Dec 27, 2022

View reviewed changes

test/grouping.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits December 27, 2022 12:32

Update test/grouping.jl

cc9b8e8

Co-authored-by: Milan Bouchet-Valat <[email protected]>

improve docstring

7e72fc6

bkamins merged commit f7b4769 into main Dec 27, 2022

bkamins deleted the bk/groupby branch December 27, 2022 13:20

jariji reviewed Dec 27, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sorting options to groupby #3253

Add sorting options to groupby #3253

bkamins commented Dec 19, 2022

bkamins commented Dec 27, 2022

jariji Dec 27, 2022

bkamins Dec 27, 2022

jariji Dec 27, 2022 •

edited

Loading

bkamins Dec 27, 2022

bkamins commented Dec 27, 2022

jariji commented Feb 5, 2023

bkamins commented Feb 5, 2023

jariji commented Feb 5, 2023

bkamins commented Feb 5, 2023

Add sorting options to groupby #3253

Add sorting options to groupby #3253

Conversation

bkamins commented Dec 19, 2022

bkamins commented Dec 27, 2022

jariji Dec 27, 2022

Choose a reason for hiding this comment

bkamins Dec 27, 2022

Choose a reason for hiding this comment

jariji Dec 27, 2022 • edited Loading

Choose a reason for hiding this comment

bkamins Dec 27, 2022

Choose a reason for hiding this comment

bkamins commented Dec 27, 2022

jariji commented Feb 5, 2023

bkamins commented Feb 5, 2023

jariji commented Feb 5, 2023

bkamins commented Feb 5, 2023

jariji Dec 27, 2022 •

edited

Loading