Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on tidyup 006: Ordering of dplyr::group_by() #21

Closed
DavisVaughan opened this issue Sep 1, 2021 · 4 comments · Fixed by #22
Closed

Feedback on tidyup 006: Ordering of dplyr::group_by() #21

DavisVaughan opened this issue Sep 1, 2021 · 4 comments · Fixed by #22

Comments

@DavisVaughan
Copy link
Member

We’d love to get your thoughts on https://github.com/tidyverse/tidyups/blob/main/006-dplyr-group-by-ordering.md, a proposal to swap out the internal algorithm used by group_by() with one that is often more performant.

The main potential issue with this adjustment is that we would be switching to using the C locale when ordering character grouping columns, where previously we used the system locale (through order()). As a reminder, group_by() internally orders the group keys, so that the next summarize() returns sorted group keys along with the computed summary columns. A result of the change proposed in this tidyup is that the order of those group keys would be different than before, since they would no longer respect the system locale.

Please feel free to contribute however you feel comfortable — you're welcome to make small fixes via PR, comment here, or open bigger discussion topics in an new issue. If there are things you’d prefer to discuss in private, please feel free to email me. I’ll plan to close the discussion on September 15, so we can review and make adjustments as needed.

@markfairbanks, @mgirlich, @eutwt, @dgrtwo, @mine-cetinkaya-rundel

@mgirlich
Copy link

Some thoughts:

  • I very much look forward to the speed up!
  • Arranging after grouping is more similar to SQL, so many people might already be familiar with this behaviour.
  • Whenever you need to arrange in descending order you need to explicitly call arrange() anyway after group_by() + summarise(). This taught me to use arrange() at the end.
  • Quite often arrange() is only used for better readability. In my opinion it then also makes sense to arrange() as one of the last steps in the pipeline anyway.

Overall, I think this makes dplyr much better with minor drawbacks. Thanks a lot for the effort!

@mine-cetinkaya-rundel
Copy link
Member

Overall, I think this seems like a great improvement, with minor drawbacks as @mgirlich mentioned.

My two reactions are as follows:

  1. To me the ordering in the following example from the tidyup is unexpected, but it's something I'm sure I'd get used to. I do think that if we had all LETTERS as groups, it would be even more awkward to have a come after Z.

    library(dplyr)
    
    df <- tibble(g = c("a", "A", "B", "b", "A", "b"))
    gdf <- group_by(df, g)
    
    # With CRAN dplyr
    summarise(gdf, n = n())
    #> # A tibble: 4 × 2
    #>   g         n
    #>   <chr> <int>
    #> 1 a         1
    #> 2 A         2
    #> 3 b         2
    #> 4 B         1
    
    # With the changes in this tidyup
    summarise(gdf, n = n())
    #> # A tibble: 4 × 2
    #>   g         n
    #>   <chr> <int>
    #> 1 A         2
    #> 2 B         1
    #> 3 a         1
    #> 4 b         2

    That being said, I don't think I've written code that relies on the order returned by summarise() so this change is visually unexpected, but wouldn't break existing code for me. I also haven't seen many teaching materials where the pipeline relies on the order returned by summarise(), so I agree that backwards compatibility risk may be mitigated, for the most part, with the strategies outlined.

  2. I agree that documentation for summarise() and group_by() will need to be updated to mention this behaviour, as stated in the tidyup.

    Documentation for both summarise() and group_by() would be updated to mention the new behavior with character vectors, and would encourage the usage of arrange() after summarise() if the returned ordering is important.

    I think it will be important to do this in a way that won't confuse those who are reading the documentation for the first time (i.e., people who are just learning dplyr). So maybe it goes under Details or something like that, instead of being featured prominently.

@DavisVaughan
Copy link
Member Author

@mine-cetinkaya-rundel yea I'm definitely thinking it will either go under Details or get it's own sub-section header like "Row Ordering", depending on how much we'd like to say about it. I agree that we wouldn't focus too much on it to avoid potentially confusing first time users, and I probably would not even include an example in the Examples section of the help page.

@DavisVaughan
Copy link
Member Author

Thanks all for your feedback! We will move forward with the proposed changes, and I will finalize the tidyup shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants