-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs::path_dir is significantly slower on Windows vs Mac #424
Comments
Obviously, even faster without going through a fast_fs <- function(x, func) {
ux <- unique(x)
uy <- func(ux)
uy[match(x, ux)]
}
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |>
rep(1e3)
bench::mark(
direct = repeated_paths |> fs::path_dir(),
indirect = repeated_paths |> fast_fs(fs::path_dir),
iterations = 50
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 54.1ms 56.3ms 17.7 749KB 0.361
#> 2 indirect 283.6µs 395.8µs 2242. 365KB 0
unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:1e4), "inner")
bench::mark(
direct = unique_paths |> fs::path_dir(),
indirect = unique_paths |> fast_fs(fs::path_dir),
iterations = 50
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 56.1ms 57.1ms 17.4 901.88KB 0.726
#> 2 indirect 56.4ms 57.5ms 17.3 1.36MB 0.723 Created on 2023-06-12 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 (2023-04-21 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.utf8
#> ctype English_United States.utf8
#> tz America/Los_Angeles
#> date 2023-06-12
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> ! package * version date (UTC) lib source
#> P bench 1.1.3 2023-05-04 [?] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> P profmem 0.6.0 2020-12-13 [?] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.3.0)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
#> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] C:/Users/LAB/Desktop/2023-05 Sequence Recovery Analysis/renv/library/R-4.3/x86_64-w64-mingw32
#> [2] C:/Users/LAB/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/830ce55b
#> [3] C:/Program Files/R/R-4.3.0/library
#>
#> P ── Loaded and on-disk path mismatch.
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Interestingly, |
Given that these functions are essentially just wrappers around the equivalent As such, I am closing this since this issue is not Mac ReprexN <- 10000
generate_vec <- function(l) lapply(1:N, \(dummy) sample(LETTERS, l, replace=T) |>
paste0(collapse="/")) |>
unlist()
vec3 <- generate_vec(3)
vec100 <- generate_vec(100)
bench::mark(
tolower(vec3),
tolower(vec100),
check=FALSE,
iterations=10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tolower(vec3) 2.9ms 3.08ms 319. 78.17KB 0
#> 2 tolower(vec100) 41.6ms 43.84ms 22.1 2.44MB 0
bench::mark(
basename(vec3),
basename(vec100),
check=FALSE,
iterations=10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 basename(vec3) 1.67ms 1.86ms 535. 78.2KB 0
#> 2 basename(vec100) 3.01ms 3.13ms 302. 78.2KB 0 Created on 2023-09-15 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.2 (2022-10-31)
#> os macOS Big Sur ... 10.16
#> system x86_64, darwin17.0
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-09-15
#> pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bench 1.1.2 2021-11-30 [1] CRAN (R 4.2.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0)
#> evaluate 0.19 2022-12-13 [1] CRAN (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0)
#> fs 1.6.2 2023-04-25 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> highr 0.10 2022-12-22 [1] CRAN (R 4.2.0)
#> htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0)
#> knitr 1.41 2022-11-18 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> profmem 0.6.0 2020-12-13 [1] CRAN (R 4.2.0)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0)
#> stringr 1.5.0 2022-12-02 [1] CRAN (R 4.2.0)
#> styler 1.8.1 2022-11-07 [1] CRAN (R 4.2.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.40 2023-08-09 [1] CRAN (R 4.2.2)
#> yaml 2.3.6 2022-10-18 [1] CRAN (R 4.2.0)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────────────── Windows ReprexN <- 10000
generate_vec <- function(l) lapply(1:N, \(dummy) sample(LETTERS, l, replace=T) |>
paste0(collapse="/")) |>
unlist()
vec3 <- generate_vec(3)
vec100 <- generate_vec(100)
bench::mark(
tolower(vec3),
tolower(vec100),
check=FALSE,
iterations=10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tolower(vec3) 5.54ms 6.06ms 165. 78.17KB 0
#> 2 tolower(vec100) 77.67ms 79.17ms 12.6 2.44MB 0
bench::mark(
basename(vec3),
basename(vec100),
check=FALSE,
iterations=10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 basename(vec3) 52.1ms 53.2ms 18.7 78.17KB 0
#> 2 basename(vec100) 99.6ms 100.7ms 9.89 4.42MB 1.10 Created on 2023-09-15 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 (2023-04-21 ucrt)
#> os Windows 10 x64 (build 19045)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.utf8
#> ctype English_United States.utf8
#> tz America/Los_Angeles
#> date 2023-09-15
#> pandoc 2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> ! package * version date (UTC) lib source
#> P bench 1.1.3 2023-05-04 [?] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [1] CRAN (R 4.3.0)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> fs 1.6.2 2023-04-25 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [1] CRAN (R 4.3.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> P profmem 0.6.0 2020-12-13 [?] CRAN (R 4.3.0)
#> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.1)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.3.1)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.3.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.1)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.3.0)
#> vctrs 0.6.2 2023-04-19 [1] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.0)
#>
#> [1] C:/Users/LAB/Desktop/2023-05 Sequence Recovery Analysis/renv/library/R-4.3/x86_64-w64-mingw32
#> [2] C:/Users/LAB/AppData/Local/R/cache/R/renv/sandbox/R-4.3/x86_64-w64-mingw32/830ce55b
#> [3] C:/Program Files/R/R-4.3.0/library
#>
#> P ── Loaded and on-disk path mismatch.
#>
#> ────────────────────────────────────────────────────────────────────────────── |
To anyone that comes across this and is looking for a solution, I've written a package to do the general case of # if(!requireNamespace("deduped")) install.packages("deduped")
library(deduped)
N_TOTAL <- 1e4
repeated_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:10), "inner") |>
rep(N_TOTAL/10) |>
sample()
bench::mark(
direct = repeated_paths |> fs::path_dir(),
indirect = repeated_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 51.8ms 52.5ms 18.9 749.02KB 2.10
#> 2 indirect 206.3µs 213.5µs 4574. 6.13MB 0
all_unique_paths <- fs::path("base", stringr::str_glue("dir{d}", d=1:N_TOTAL), "inner")
bench::mark(
direct = all_unique_paths |> fs::path_dir(),
indirect = all_unique_paths |> deduped(fs::path_dir)(),
iterations = 10
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 direct 53.6ms 54.6ms 18.3 901.88KB 0
#> 2 indirect 53.6ms 54.9ms 18.2 1.03MB 2.02 Created on 2023-10-26 with reprex v2.0.2 |
When you use
read_csv(..., id="file_path")
you end up with afile_path
column that has lots of repeats. Manipulations on this kind of column are somewhat slow. I have found that it's much faster to do aleft_join(x, distinct(x, file_path) |> mutate(...), by="file_path")
than it is to do the mutate directly.Below I timed using
fs::path_dir
directly (with no tibble involvement at all) compared to going via a tibble to do the above. Even though the latter is significantly more indirect, it's more than 13X faster:Created on 2023-06-12 with reprex v2.0.2
Session info
The text was updated successfully, but these errors were encountered: