Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of group_by on glue column #5423

Closed
dkutner opened this issue Jul 19, 2020 · 1 comment · Fixed by r-lib/vctrs#1197
Closed

Performance of group_by on glue column #5423

dkutner opened this issue Jul 19, 2020 · 1 comment · Fixed by r-lib/vctrs#1197
Assignees

Comments

@dkutner
Copy link

dkutner commented Jul 19, 2020

Grouping by a glue column is slow compared to grouping by a character column. I'm using dplyr 1.0.0, rlang 0.4.7, tibble 3.0.3, and vctrs 0.3.2.

frame <- tibble::tibble(glue_col = glue::glue("{1:10000}"),
                        character_col = as.character(glue_col))
bench::mark(grouped_frame = dplyr::group_by(frame, glue_col))
#> # A tibble: 1 x 13
#>   expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#>   <bch:expr>    <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#> 1 grouped_frame 2.24s  2.24s     0.446    4.45MB     24.5     1    55      2.24s
#> Warning message:
#> Some expressions had a GC in every iteration; so filtering is disabled.
bench::mark(grouped_frame = dplyr::group_by(frame, character_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 18.7ms 19.3ms      51.8    1.32MB     4.32    24     2

Created on 2020-07-18 by the reprex package (v0.3.0)

Running the same thing with vctrs 0.3.1 does not have the same issue:

frame <- tibble::tibble(glue_col = glue::glue("{1:10000}"),
                        character_col = as.character(glue_col))
bench::mark(grouped_frame = dplyr::group_by(frame, glue_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 18.3ms   19ms      51.9    4.01MB     9.44    22     4
bench::mark(grouped_frame = dplyr::group_by(frame, character_col))
#> # A tibble: 1 x 13
#>   expression       min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#>   <bch:expr>    <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#> 1 grouped_frame 16.9ms 17.3ms      56.9    1.32MB     12.4    23     5

Created on 2020-07-18 by the reprex package (v0.3.0)

@lionel-
Copy link
Member

lionel- commented Jul 23, 2020

Fixed in vctrs, I'll send a new release to CRAN next week.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue May 30, 2021
(pkgsrc changes)
 - Add TEST_DEPENDS+, but still one missing is there

(upstream changes)

# vctrs 0.3.8

* Compatibility with next version of rlang.


# vctrs 0.3.7

* `vec_ptype_abbr()` gains arguments to control whether to indicate
  named vectors with a prefix (`prefix_named`) and indicate shaped
  vectors with a suffix (`suffix_shape`) (#781, @krlmlr).

* `vec_ptype()` is now an optional _performance_ generic. It is not necessary
  to implement, but if your class has a static prototype, you might consider
  implementing a custom `vec_ptype()` method that returns a constant to
  improve performance in some cases (such as common type imputation).

* New `vec_detect_complete()`, inspired by `stats::complete.cases()`. For most
  vectors, this is identical to `!vec_equal_na()`. For data frames and
  matrices, this detects rows that only contain non-missing values.

* `vec_order()` can now order complex vectors (#1330).

* Removed dependency on digest in favor of `rlang::hash()`.

* Fixed an issue where `vctrs_rcrd` objects were not being proxied correctly
  when used as a data frame column (#1318).

* `register_s3()` is now licensed with the "unlicense" which makes it very
  clear that it's fine to copy and paste into your own package
  (@maxheld83, #1254).

# vctrs 0.3.6

* Fixed an issue with tibble 3.0.0 where removing column names with
  `names(x) <- NULL` is now deprecated (#1298).

* Fixed a GCC 11 issue revealed by CRAN checks.


# vctrs 0.3.5

* New experimental `vec_fill_missing()` for filling in missing values with
  the previous or following value. It is similar to `tidyr::fill()`, but
  also works with data frames and has an additional `max_fill` argument to
  limit the number of sequential missing values to fill.

* New `vec_unrep()` to compress a vector with repeated values. It is very
  similar to run length encoding, and works nicely alongside `vec_rep_each()`
  as a way to invert the compression.

* `vec_cbind()` with only empty data frames now preserves the common size of
  the inputs in the result (#1281).

* `vec_c()` now correctly returns a named result with named empty inputs
  (#1263).

* vctrs has been relicensed as MIT (#1259).

* Functions that make comparisons within a single vector, such as
  `vec_unique()`, or between two vectors, such as `vec_match()`, now
  convert all character input to UTF-8 before making comparisons (#1246).

* New `vec_identify_runs()` which returns a vector of identifiers for the
  elements of `x` that indicate which run of repeated values they fall in
  (#1081).

* Fixed an encoding translation bug with lists containing data frames which
  have columns where `vec_size()` is different from the low level
  `Rf_length()` (#1233).


# vctrs 0.3.4

* Fixed a GCC sanitiser error revealed by CRAN checks.


# vctrs 0.3.3

* The `table` class is now implemented as a wrapper type that
  delegates its coercion methods. It used to be restricted to integer
  tables (#1190).

* Named one-dimensional arrays now behave consistently with simple
  vectors in `vec_names()` and `vec_rbind()`.

* `new_rcrd()` now uses `df_list()` to validate the fields. This makes
  it more flexible as the fields can now be of any type supported by
  vctrs, including data frames.

* Thanks to the previous change the `[[` method of records now
  preserves list fields (#1205).

* `vec_data()` now preserves data frames. This is consistent with the
  notion that data frames are a primitive vector type in vctrs. This
  shouldn't affect code that uses `[[` and `length()` to manipulate
  the data. On the other hand, the vctrs primitives like `vec_slice()`
  will now operate rowwise when `vec_data()` returns a data frame.

* `outer` is now passed unrecycled to name specifications. Instead,
  the return value is recycled (#1099).

* Name specifications can now return `NULL`. The names vector will
  only be allocated if the spec function returns non-`NULL` during the
  concatenation. This makes it possible to ignore outer names without
  having to create an empty names vector when there are no inner
  names:

  ```
  zap_outer_spec <- function(outer, inner) if (is_character(inner)) inner

  # `NULL` names rather than a vector of ""
  names(vec_c(a = 1:2, .name_spec = zap_outer_spec))
  #> NULL

  # Names are allocated when inner names exist
  names(vec_c(a = 1:2, c(b = 3L), .name_spec = zap_outer_spec))
  #> [1] ""  ""  "b"
  ```

* Fixed several performance issues in `vec_c()` and `vec_unchop()`
  with named vectors.

* The restriction that S3 lists must have a list-based proxy to be considered
  lists by `vec_is_list()` has been removed (#1208).

* New performant `data_frame()` constructor for creating data frames in a way
  that follows tidyverse semantics. Among other things, inputs are recycled
  using tidyverse recycling rules, strings are never converted to factors,
  list-columns are easier to create, and unnamed data frame input is
  automatically spliced.

* New `df_list()` for safely and consistently constructing the data structure
  underlying a data frame, a named list of equal-length vectors. It is useful
  in combination with `new_data_frame()` for creating user-friendly
  constructors for data frame subclasses that use the tidyverse rules for
  recycling and determining types.

* Fixed performance issue with `vec_order()` on classed vectors which
  affected `dplyr::group_by()` (tidyverse/dplyr#5423).

* `vec_set_names()` no longer alters the input in-place (#1194).

* New `vec_proxy_order()` that provides an ordering proxy for use in
  `vec_order()` and `vec_sort()`. The default method falls through to
  `vec_proxy_compare()`. Lists are special cased, and return an integer
  vector proxy that orders by first appearance.

* List columns in data frames are no longer comparable through `vec_compare()`.

* The experimental `relax` argument has been removed from
  `vec_proxy_compare()`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants