-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add helper to get vec_unique
and vec_group_id
in one go.
#1857
Comments
I think you may want something like #1851 So I think it would look like
Not 100% sure as I didn't work an example out, but I think that's right |
I had assumed that What you've suggested ( I don't know if the added complexity is worth the time savings, though. One case in which I think it might be is when using |
vec_unique_loc
and match(x, unique(x))
in one go.vec_unique_loc
and vec_group_id
in one go.
vec_unique_loc
and vec_group_id
in one go.vec_unique
and vec_group_id
in one go.
@DavisVaughan What do you think of #1882? I've added |
@DavisVaughan After testing the performance of #1882, I realized that the gain isn't sufficiently better than using the separate functions. And if performance is desired it's better to use |
I ended up implementing this in a separate package # if(!requireNamespace("deduped")) install.packages("deduped")
library(deduped)
N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |>
rep(N_TOTAL/10) |>
sample()
bench::mark(
direct = repeated_paths |> fs::path_dir(),
indirect = repeated_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 51.8ms 52.5ms 18.9 749.02KB 2.10
#> 2 indirect 206.3µs 213.5µs 4574. 6.13MB 0
all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")
bench::mark(
direct = all_unique_paths |> fs::path_dir(),
indirect = all_unique_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 53.6ms 54.6ms 18.3 901.88KB 0
#> 2 indirect 53.6ms 54.9ms 18.2 1.03MB 2.02 Created on 2023-10-26 with reprex v2.0.2 |
I discovered that
fs::path_file
andfs::path_dir
run very slowly on windows (see r-lib/fs#424), and since most of my use case of these functions is after usingreadr::read_csv(files, .id="file_path")
, most of the vector is duplication. As such, I found that I could save a significant amount of time by deduplicating the vector (2x on Mac, 40x on Windows). This isn't just limited tofs::path_
functions, however.The most straightforward approach is:
However, calculating both
unique(x)
andmatch(x, ux)
is duplicated effort since it could be done in one go by combining the implementations ofvctrs::vec_unique_loc
andvctrs::vec_duplicate_id
. This makes the deduplicated faster, but also means that the overhead of running this on unique vectors is significantly reduced.In the particular case below, using this implementation is ~2x faster than the naive implementation above. This is critical in the case of the entirely unique vector, where this approach essentially entirely removes the overhead. When the vector is smaller, the benefit is smaller, but still improved.
Created on 2023-07-08 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: