Performance of group_by on glue column #5423

dkutner · 2020-07-19T03:12:31Z

Grouping by a glue column is slow compared to grouping by a character column. I'm using dplyr 1.0.0, rlang 0.4.7, tibble 3.0.3, and vctrs 0.3.2.

frame <- tibble::tibble(glue_col = glue::glue("{1:10000}"),
                        character_col = as.character(glue_col))
bench::mark(grouped_frame = dplyr::group_by(frame, glue_col))
#> # A tibble: 1 x 13
#>   expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 grouped_frame 2.24s  2.24s     0.446    4.45MB     24.5     1    55      2.24s
#> Warning message:
#> Some expressions had a GC in every iteration; so filtering is disabled.
bench::mark(grouped_frame = dplyr::group_by(frame, character_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 18.7ms 19.3ms      51.8    1.32MB     4.32    24     2

^{Created on 2020-07-18 by the reprex package (v0.3.0)}

Running the same thing with vctrs 0.3.1 does not have the same issue:

frame <- tibble::tibble(glue_col = glue::glue("{1:10000}"),
                        character_col = as.character(glue_col))
bench::mark(grouped_frame = dplyr::group_by(frame, glue_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 18.3ms   19ms      51.9    4.01MB     9.44    22     4
bench::mark(grouped_frame = dplyr::group_by(frame, character_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 16.9ms 17.3ms      56.9    1.32MB     12.4    23     5

^{Created on 2020-07-18 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

Closes tidyverse/dplyr#5423 Introduced in r-lib#1142

Closes tidyverse/dplyr#5423 Introduced in #1142

lionel- · 2020-07-23T07:20:43Z

Fixed in vctrs, I'll send a new release to CRAN next week.

@krlmlr

(pkgsrc changes) - Add TEST_DEPENDS+, but still one missing is there (upstream changes) # vctrs 0.3.8 * Compatibility with next version of rlang. # vctrs 0.3.7 * `vec_ptype_abbr()` gains arguments to control whether to indicate named vectors with a prefix (`prefix_named`) and indicate shaped vectors with a suffix (`suffix_shape`) (#781, @krlmlr). * `vec_ptype()` is now an optional _performance_ generic. It is not necessary to implement, but if your class has a static prototype, you might consider implementing a custom `vec_ptype()` method that returns a constant to improve performance in some cases (such as common type imputation). * New `vec_detect_complete()`, inspired by `stats::complete.cases()`. For most vectors, this is identical to `!vec_equal_na()`. For data frames and matrices, this detects rows that only contain non-missing values. * `vec_order()` can now order complex vectors (#1330). * Removed dependency on digest in favor of `rlang::hash()`. * Fixed an issue where `vctrs_rcrd` objects were not being proxied correctly when used as a data frame column (#1318). * `register_s3()` is now licensed with the "unlicense" which makes it very clear that it's fine to copy and paste into your own package (@maxheld83, #1254). # vctrs 0.3.6 * Fixed an issue with tibble 3.0.0 where removing column names with `names(x) <- NULL` is now deprecated (#1298). * Fixed a GCC 11 issue revealed by CRAN checks. # vctrs 0.3.5 * New experimental `vec_fill_missing()` for filling in missing values with the previous or following value. It is similar to `tidyr::fill()`, but also works with data frames and has an additional `max_fill` argument to limit the number of sequential missing values to fill. * New `vec_unrep()` to compress a vector with repeated values. It is very similar to run length encoding, and works nicely alongside `vec_rep_each()` as a way to invert the compression. * `vec_cbind()` with only empty data frames now preserves the common size of the inputs in the result (#1281). * `vec_c()` now correctly returns a named result with named empty inputs (#1263). * vctrs has been relicensed as MIT (#1259). * Functions that make comparisons within a single vector, such as `vec_unique()`, or between two vectors, such as `vec_match()`, now convert all character input to UTF-8 before making comparisons (#1246). * New `vec_identify_runs()` which returns a vector of identifiers for the elements of `x` that indicate which run of repeated values they fall in (#1081). * Fixed an encoding translation bug with lists containing data frames which have columns where `vec_size()` is different from the low level `Rf_length()` (#1233). # vctrs 0.3.4 * Fixed a GCC sanitiser error revealed by CRAN checks. # vctrs 0.3.3 * The `table` class is now implemented as a wrapper type that delegates its coercion methods. It used to be restricted to integer tables (#1190). * Named one-dimensional arrays now behave consistently with simple vectors in `vec_names()` and `vec_rbind()`. * `new_rcrd()` now uses `df_list()` to validate the fields. This makes it more flexible as the fields can now be of any type supported by vctrs, including data frames. * Thanks to the previous change the `[[` method of records now preserves list fields (#1205). * `vec_data()` now preserves data frames. This is consistent with the notion that data frames are a primitive vector type in vctrs. This shouldn't affect code that uses `[[` and `length()` to manipulate the data. On the other hand, the vctrs primitives like `vec_slice()` will now operate rowwise when `vec_data()` returns a data frame. * `outer` is now passed unrecycled to name specifications. Instead, the return value is recycled (#1099). * Name specifications can now return `NULL`. The names vector will only be allocated if the spec function returns non-`NULL` during the concatenation. This makes it possible to ignore outer names without having to create an empty names vector when there are no inner names: ``` zap_outer_spec <- function(outer, inner) if (is_character(inner)) inner # `NULL` names rather than a vector of "" names(vec_c(a = 1:2, .name_spec = zap_outer_spec)) #> NULL # Names are allocated when inner names exist names(vec_c(a = 1:2, c(b = 3L), .name_spec = zap_outer_spec)) #> [1] "" "" "b" ``` * Fixed several performance issues in `vec_c()` and `vec_unchop()` with named vectors. * The restriction that S3 lists must have a list-based proxy to be considered lists by `vec_is_list()` has been removed (#1208). * New performant `data_frame()` constructor for creating data frames in a way that follows tidyverse semantics. Among other things, inputs are recycled using tidyverse recycling rules, strings are never converted to factors, list-columns are easier to create, and unnamed data frame input is automatically spliced. * New `df_list()` for safely and consistently constructing the data structure underlying a data frame, a named list of equal-length vectors. It is useful in combination with `new_data_frame()` for creating user-friendly constructors for data frame subclasses that use the tidyverse rules for recycling and determining types. * Fixed performance issue with `vec_order()` on classed vectors which affected `dplyr::group_by()` (tidyverse/dplyr#5423). * `vec_set_names()` no longer alters the input in-place (#1194). * New `vec_proxy_order()` that provides an ordering proxy for use in `vec_order()` and `vec_sort()`. The default method falls through to `vec_proxy_compare()`. Lists are special cased, and return an integer vector proxy that orders by first appearance. * List columns in data frames are no longer comparable through `vec_compare()`. * The experimental `relax` argument has been removed from `vec_proxy_compare()`.

lionel- self-assigned this Jul 19, 2020

lionel- added a commit to lionel-/vctrs that referenced this issue Jul 19, 2020

Fix performance issue in vec_order() with classed vectors

2ed3820

Closes tidyverse/dplyr#5423 Introduced in r-lib#1142

lionel- mentioned this issue Jul 19, 2020

Fix performance issue in vec_order() with classed vectors r-lib/vctrs#1197

Merged

lionel- added a commit to lionel-/vctrs that referenced this issue Jul 23, 2020

Fix performance issue in vec_order() with classed vectors

2bf7c91

Closes tidyverse/dplyr#5423 Introduced in r-lib#1142

lionel- closed this as completed in r-lib/vctrs#1197 Jul 23, 2020

lionel- added a commit to r-lib/vctrs that referenced this issue Jul 23, 2020

Fix performance issue in vec_order() with classed vectors (#1197)

8855bdd

Closes tidyverse/dplyr#5423 Introduced in #1142

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of group_by on glue column #5423

Performance of group_by on glue column #5423

dkutner commented Jul 19, 2020

lionel- commented Jul 23, 2020

Performance of group_by on glue column #5423

Performance of group_by on glue column #5423

Comments

dkutner commented Jul 19, 2020

lionel- commented Jul 23, 2020