Add helper to get `vec_unique` and `vec_group_id` in one go. #1857

orgadish · 2023-07-09T02:38:16Z

I discovered that fs::path_file and fs::path_dir run very slowly on windows (see r-lib/fs#424), and since most of my use case of these functions is after using readr::read_csv(files, .id="file_path"), most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This isn't just limited to fs::path_ functions, however.

The most straightforward approach is:

with_deduplication <- function(f) {
  function(x, ...) {
    ux <- unique(x)
    f(ux, ...)[match(x, ux)]
  }
}

However, calculating both unique(x) and match(x, ux) is duplicated effort since it could be done in one go by combining the implementations of vctrs::vec_unique_loc and vctrs::vec_duplicate_id. This makes the deduplicated faster, but also means that the overhead of running this on unique vectors is significantly reduced.

In the particular case below, using this implementation is ~2x faster than the naive implementation above. This is critical in the case of the entirely unique vector, where this approach essentially entirely removes the overhead. When the vector is smaller, the benefit is smaller, but still improved.

set.seed(0)

TOTAL_N <- 1e6
unique_vector <- purrr::map_chr(1:(TOTAL_N*2), \(...) sample(LETTERS, 5) |> paste(collapse="")) |> 
  unique() |> 
  vctrs::vec_slice(1:TOTAL_N)

repeated_vector <- sample(unique_vector, 10) |> 
  rep(TOTAL_N / 10) |> 
  sample()

call_with_deduplication <- function(x, f) {
  ux <- unique(x)
  f(ux)[match(x, ux)]
}

# Simulating what the implementation would look like by precomputing match(x, ux).
rep_match_locs <- match(repeated_vector, unique(repeated_vector))
uni_match_locs <- match(unique_vector, unique(unique_vector))  # Actually same as 1:length(unique_vector)
vec_call_with_deduplication <- function(x, f, match_locs) {
  ux <- vctrs::vec_slice(x, vctrs::vec_unique_loc(x))
  f(ux)[match_locs]
}

dplyr::bind_rows(
  bench::mark(
    rep_std = repeated_vector |> stringr::str_to_lower(),
    rep_dedup = repeated_vector |> call_with_deduplication(stringr::str_to_lower),
    rep_vec_dedup = repeated_vector |> vec_call_with_deduplication(stringr::str_to_lower, rep_match_locs),
    iterations = 100
  ),
  
  bench::mark(
    uni_std = unique_vector |> stringr::str_to_lower(),
    uni_dedup = unique_vector |> call_with_deduplication(stringr::str_to_lower),
    uni_vec_dedup = unique_vector |> vec_call_with_deduplication(stringr::str_to_lower, uni_match_locs),
    iterations = 100
  )
)
#> # A tibble: 6 × 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rep_std        110.5ms  132.4ms      7.17    7.66MB    1.07 
#> 2 rep_dedup       45.8ms   49.1ms     20.2    30.89MB   60.7  
#> 3 rep_vec_dedup   19.7ms   20.6ms     48.2    19.45MB   27.1  
#> 4 uni_std        374.6ms  558.5ms      1.69   23.63MB    0.209
#> 5 uni_dedup        749ms  843.8ms      1.08   61.78MB    2.40 
#> 6 uni_vec_dedup    469ms    498ms      1.92   46.52MB    2.35

^{Created on 2023-07-08 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       macOS Big Sur ... 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Los_Angeles
#>  date     2023-07-08
#>  pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench         1.1.2   2021-11-30 [1] CRAN (R 4.2.0)
#>  cli           3.6.1   2023-03-23 [1] CRAN (R 4.2.0)
#>  digest        0.6.31  2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr         1.1.2   2023-04-20 [1] CRAN (R 4.2.0)
#>  evaluate      0.19    2022-12-13 [1] CRAN (R 4.2.0)
#>  fansi         1.0.4   2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs            1.6.2   2023-04-25 [1] CRAN (R 4.2.0)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.0)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr         0.10    2022-12-22 [1] CRAN (R 4.2.0)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.0)
#>  knitr         1.41    2022-11-18 [1] CRAN (R 4.2.0)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.2.0)
#>  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.2  2022-11-11 [1] CRAN (R 4.2.0)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.0)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown     2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.0)
#>  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.2.0)
#>  styler        1.8.1   2022-11-07 [1] CRAN (R 4.2.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.2.0)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.0)
#>  utf8          1.2.3   2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.2.0)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun          0.36    2022-12-21 [1] CRAN (R 4.2.0)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

DavisVaughan · 2023-07-09T15:53:57Z

I think you may want something like #1851

So I think it would look like

info <- vec_group_loc(x)
vec_slice(f(info$key), list_ungroup(info$loc))

Not 100% sure as I didn't work an example out, but I think that's right

orgadish · 2023-07-09T18:53:56Z

I had assumed that vec_group_loc had to do with dataframe "groups", so I didn't even consider it here, but I see now that it's doing sort of what I'm looking for (and that vec_group_id is in fact equivalent to match(x, unique(x))). Thanks for pointing that out!

What you've suggested (list_ungroup(info$loc)) is certainly much faster than vec_group_id(x) for vectors with significant repetition — O(length(unique(x)) vs O(length(x)) — but is still slower than getting both bits of data in the first go: getting the match is essentially O(1) since we just have to access the second element in the result.

I don't know if the added complexity is worth the time savings, though. One case in which I think it might be is when using with_deduplication without knowing the amount of repetition ahead of time. If there was no repetition, the match would be O(length(x)) and would more than double the time to run the original function. I guess I could check whether length(info$keys) is close to length(x) and skip the match call as a result.

orgadish · 2023-10-09T03:56:48Z

@DavisVaughan What do you think of #1882? I've added vec_unique_loc into the computation of vec_group_id, though I could definitely separate that out into a new helper (perhaps even one that is not exported and only used for vec_deduplicate).

orgadish · 2023-10-09T07:40:00Z

@DavisVaughan After testing the performance of #1882, I realized that the gain isn't sufficiently better than using the separate functions. And if performance is desired it's better to use collapse::funique and data.table::chmatch, for example.

orgadish · 2023-10-27T04:02:14Z

I ended up implementing this in a separate package deduped, since I use it often (though using fastmatch::fmatch instead of data.table::chmatch):

# if(!requireNamespace("deduped")) install.packages("deduped")

library(deduped)

N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |> 
  rep(N_TOTAL/10) |> 
  sample()

bench::mark(
  direct = repeated_paths |> fs::path_dir(),
  indirect = repeated_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       51.8ms   52.5ms      18.9  749.02KB     2.10
#> 2 indirect    206.3µs  213.5µs    4574.     6.13MB     0

all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")

bench::mark(
  direct = all_unique_paths |> fs::path_dir(),
  indirect = all_unique_paths |> deduped(fs::path_dir)(),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 direct       53.6ms   54.6ms      18.3  901.88KB     0   
#> 2 indirect     53.6ms   54.9ms      18.2    1.03MB     2.02

^{Created on 2023-10-26 with reprex v2.0.2}

orgadish mentioned this issue Jul 9, 2023

Implement vec_deduplicate to get ids necessary to deduplicate and reduplicate a vector. #1858

Closed

orgadish changed the title ~~Add helper to get vec_unique_loc and match(x, unique(x)) in one go.~~ Add helper to get vec_unique_loc and vec_group_id in one go. Jul 27, 2023

orgadish mentioned this issue Jul 27, 2023

"Memoization" on elements in a vector input. r-lib/memoise#149

Closed

orgadish mentioned this issue Oct 8, 2023

list_ungroup() to invert vec_group_loc()? #1851

Open

orgadish changed the title ~~Add helper to get vec_unique_loc and vec_group_id in one go.~~ Add helper to get vec_unique and vec_group_id in one go. Oct 9, 2023

orgadish mentioned this issue Oct 9, 2023

Add vec_deduplicate and relevant helpers. #1882

Closed

orgadish closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add helper to get `vec_unique` and `vec_group_id` in one go. #1857

Add helper to get `vec_unique` and `vec_group_id` in one go. #1857

orgadish commented Jul 9, 2023 •

edited

Loading

DavisVaughan commented Jul 9, 2023

orgadish commented Jul 9, 2023

orgadish commented Oct 9, 2023

orgadish commented Oct 9, 2023

orgadish commented Oct 27, 2023

Add helper to get vec_unique and vec_group_id in one go. #1857

Add helper to get vec_unique and vec_group_id in one go. #1857

Comments

orgadish commented Jul 9, 2023 • edited Loading

DavisVaughan commented Jul 9, 2023

orgadish commented Jul 9, 2023

orgadish commented Oct 9, 2023

orgadish commented Oct 9, 2023

orgadish commented Oct 27, 2023

Add helper to get `vec_unique` and `vec_group_id` in one go. #1857

Add helper to get `vec_unique` and `vec_group_id` in one go. #1857

orgadish commented Jul 9, 2023 •

edited

Loading