Feedback on tidyup 006: Ordering of `dplyr::group_by()` #21

DavisVaughan · 2021-09-01T19:53:19Z

We’d love to get your thoughts on https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md, a proposal to swap out the internal algorithm used by group_by() with one that is often more performant.

The main potential issue with this adjustment is that we would be switching to using the C locale when ordering character grouping columns, where previously we used the system locale (through order()). As a reminder, group_by() internally orders the group keys, so that the next summarize() returns sorted group keys along with the computed summary columns. A result of the change proposed in this tidyup is that the order of those group keys would be different than before, since they would no longer respect the system locale.

Please feel free to contribute however you feel comfortable — you're welcome to make small fixes via PR, comment here, or open bigger discussion topics in an new issue. If there are things you’d prefer to discuss in private, please feel free to email me. I’ll plan to close the discussion on September 15, so we can review and make adjustments as needed.

@markfairbanks, @mgirlich, @eutwt, @dgrtwo, @mine-cetinkaya-rundel

The text was updated successfully, but these errors were encountered:

mgirlich · 2021-09-13T08:25:31Z

Some thoughts:

I very much look forward to the speed up!
Arranging after grouping is more similar to SQL, so many people might already be familiar with this behaviour.
Whenever you need to arrange in descending order you need to explicitly call arrange() anyway after group_by() + summarise(). This taught me to use arrange() at the end.
Quite often arrange() is only used for better readability. In my opinion it then also makes sense to arrange() as one of the last steps in the pipeline anyway.

Overall, I think this makes dplyr much better with minor drawbacks. Thanks a lot for the effort!

mine-cetinkaya-rundel · 2021-09-13T19:12:30Z

Overall, I think this seems like a great improvement, with minor drawbacks as @mgirlich mentioned.

My two reactions are as follows:

To me the ordering in the following example from the tidyup is unexpected, but it's something I'm sure I'd get used to. I do think that if we had all LETTERS as groups, it would be even more awkward to have a come after Z.
```
library(dplyr)

df <- tibble(g = c("a", "A", "B", "b", "A", "b"))
gdf <- group_by(df, g)

# With CRAN dplyr
summarise(gdf, n = n())
#> # A tibble: 4 × 2
#>   g         n
#>   <chr> <int>
#> 1 a         1
#> 2 A         2
#> 3 b         2
#> 4 B         1

# With the changes in this tidyup
summarise(gdf, n = n())
#> # A tibble: 4 × 2
#>   g         n
#>   <chr> <int>
#> 1 A         2
#> 2 B         1
#> 3 a         1
#> 4 b         2
```
That being said, I don't think I've written code that relies on the order returned by summarise() so this change is visually unexpected, but wouldn't break existing code for me. I also haven't seen many teaching materials where the pipeline relies on the order returned by summarise(), so I agree that backwards compatibility risk may be mitigated, for the most part, with the strategies outlined.
I agree that documentation for summarise() and group_by() will need to be updated to mention this behaviour, as stated in the tidyup.

Documentation for both summarise() and group_by() would be updated to mention the new behavior with character vectors, and would encourage the usage of arrange() after summarise() if the returned ordering is important.

I think it will be important to do this in a way that won't confuse those who are reading the documentation for the first time (i.e., people who are just learning dplyr). So maybe it goes under Details or something like that, instead of being featured prominently.

DavisVaughan · 2021-09-13T19:38:11Z

@mine-cetinkaya-rundel yea I'm definitely thinking it will either go under Details or get it's own sub-section header like "Row Ordering", depending on how much we'd like to say about it. I agree that we wouldn't focus too much on it to avoid potentially confusing first time users, and I probably would not even include an example in the Examples section of the help page.

DavisVaughan · 2021-09-15T21:06:31Z

Thanks all for your feedback! We will move forward with the proposed changes, and I will finalize the tidyup shortly.

DavisVaughan closed this as completed Sep 15, 2021

DavisVaughan mentioned this issue Sep 16, 2021

Finalize and accept tidyup 006 #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feedback on tidyup 006: Ordering of `dplyr::group_by()` #21

Feedback on tidyup 006: Ordering of `dplyr::group_by()` #21

DavisVaughan commented Sep 1, 2021

mgirlich commented Sep 13, 2021

mine-cetinkaya-rundel commented Sep 13, 2021

DavisVaughan commented Sep 13, 2021

DavisVaughan commented Sep 15, 2021

Feedback on tidyup 006: Ordering of dplyr::group_by() #21

Feedback on tidyup 006: Ordering of dplyr::group_by() #21

Comments

DavisVaughan commented Sep 1, 2021

mgirlich commented Sep 13, 2021

mine-cetinkaya-rundel commented Sep 13, 2021

DavisVaughan commented Sep 13, 2021

DavisVaughan commented Sep 15, 2021

Feedback on tidyup 006: Ordering of `dplyr::group_by()` #21

Feedback on tidyup 006: Ordering of `dplyr::group_by()` #21