Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a .locale argument to arrange() and use radix ordering #5868

Closed

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Apr 28, 2021

Closes #4962
Closes #5090
Part of #5808

Requires this vctrs PR r-lib/vctrs#1375

The above vctrs PR switches vec_order() to always use radix ordering. This propagates through to dplyr in 3 places:

  • arrange()
  • group_by()
  • with_order()

This PR tackles arrange() by providing a new .locale argument with 3 possibilities:

  • NULL, the default, for C locale
  • A single string, like "en_US", to use a specific locale, backed by and requiring stringi
  • A formula/function to generate the sort key (i.e. stringi::stri_sort_key(), or tolower())

This PR is currently focused on the implementation, I've left TODOs for the news bullet and for documentation of locale handling.

The snapshot tests are a little noisy. Dev vctrs requires dev rlang, which tweaked how deparsing with ~ works, changing a few snapshot tests here.

Build errors seem to be from the fact that rlang now exports ellipsis functions, causing some warnings


This would be a meaningful breaking change, as the system locale is no longer being respected. However, there are multiple benefits from this change:

  • Sorting character vectors in the C locale is much much faster
  • Even generating a sort key and then sorting in the C locale is faster than the current approach
  • This removes dependence on an environment variable (LC_COLLATE), which we generally advocate in favor of explicitly specifying arguments. In other words, this gets us better consistency across platforms.
  • This increases consistency with data.table, who always uses the C locale
  • This increases consistency with stringi (and you could make an argument for stringr too)

For English, the biggest change is that uppercase letters now sort before any lowercase letters. Previously it used a natural ordering of c("a", "A", "b", "B").

For other languages, sorting directly in the C locale is often not very meaningful, so they would have to supply a locale identifier.

@DavisVaughan DavisVaughan requested review from hadley and lionel- April 28, 2021 18:46
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — we'll just need to leave this sit until we figure out a unified strategy.

R/arrange.R Outdated Show resolved Hide resolved
tests/testthat/_snaps/across.md Show resolved Hide resolved
tests/testthat/test-arrange.r Outdated Show resolved Hide resolved
R/arrange.R Outdated Show resolved Hide resolved
DavisVaughan and others added 2 commits April 29, 2021 08:22
Co-authored-by: Lionel Henry <[email protected]>
So now we see the actual code, along with the error output and class
R/arrange.R Outdated Show resolved Hide resolved
This constrains the UI to start with, instead encouraging users to apply transformations in the `arrange()` call itself
Falls back to C with a warning otherwise. Explicitly specifying `"C"` is now also an option.
@DavisVaughan
Copy link
Member Author

Updated so that:

  • .locale = NULL defaults to "en" if stringi is available, otherwise it falls back to the C locale with a warning
  • .locale = "C" is now a valid option

Even though these are superseded, this should help ease the transition a little, since without this argument it would be difficult to choose a different locale, and if stringi wasn't installed then you'd unconditionally get a warning you couldn't silence
@DavisVaughan
Copy link
Member Author

Updated so that:

  • We have a new exported dplyr_locale() helper. It defaults to "en" if stringi is installed, otherwise it falls back to "C" with a warning. Users can set options(tidyverse.locale_collation) if they want to globally override the default locale.
  • .locale in arrange() now defaults to dplyr_locale()
  • .locale no longer allows NULL
  • The scoped variants of arrange() have gained .locale to ease the transition a little

@DavisVaughan
Copy link
Member Author

Closing in favor of the cleaner #5942

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance drop-off for arrange()
3 participants