-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
summarize performance drop off with many groups #5017
Comments
This comment was marked as outdated.
This comment was marked as outdated.
We know about this and we have a plan, it won't happen in 1.0.0 though. |
This comment was marked as outdated.
This comment was marked as outdated.
Yes, what happens is that some functions previously benefited from some ad hoc handling we called "hybrid evaluation". With the major rewrite of |
Hi, my organization upgraded to dplyr v1.0.0 and we are noticing a significant increase in computation time (e.g., from 30 seconds to an hour for a data.frame with 20k columns). Just wanted to chime in and underscore that addressing this in a new version ASAP will be much appreciated! Thanks! |
@pfreese is there any chance you could provide a reprex based on random data? It's possible that your issue is different from the others on the thread. |
|
Using @pfreese's code example I am seeing a >100x difference in computation time on my Macbook for the last step (creation of m2) between dplyr versions 0.8.5 and 1.0.0. Is that expected? |
@fullmetalomelette obviously that’s not expected. |
This comment was marked as outdated.
This comment was marked as outdated.
Was just wondering if there were any updates on this / when dplyr 1.1.0 (if that when this will be fixed) would be released ? Overall finding dplyr 1.+ much slower than 0.8.+, in particular when operations on grouped data |
I'm curious about this too. I'm doing some straightforward This sort of calculation isn't really amenable to a repex even though the data size is small (input tibble is <200MB in memory) but if there are log files or other diagnostics I can collect I'm happy to provide those. I believe I can also make my data files and a fairly simple script available via email and cloud storage if someone wants to take a closer look. |
The issues with For the original reprex, and current dev version, I see: library(dplyr, warn.conflicts = FALSE)
n <- 1e6
seq_n <- function(m, n) rep(1:m, length.out = n)
df_mny <- tibble(g = seq_n(1e5, n), x = runif(n)) |> group_by(g)
df_few <- tibble(g = seq_n(100, n), x = runif(n)) |> group_by(g)
bench::mark(
df_mny |> summarize(y = n()),
df_mny |> summarize(y = max(x)),
df_few |> summarize(y = n()),
df_few |> summarize(y = max(x)),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 summarize(df_mny, y = n()) 652.46ms 652.46ms 1.53 6.47MB 18.4
#> 2 summarize(df_mny, y = max(x)) 132.74ms 143.84ms 6.33 8.4MB 14.2
#> 3 summarize(df_few, y = n()) 953.79µs 987.54µs 877. 7.28KB 14.0
#> 4 summarize(df_few, y = max(x)) 4.66ms 5.43ms 167. 7.64MB 31.9 Created on 2022-07-31 by the reprex package (v2.0.1) And for 0.8.5: #> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 summarize(df_mny, y = n()) 1.8ms 1.91ms 508. 414KB 10.4
#> 2 summarize(df_mny, y = max(x)) 5.19ms 5.33ms 186. 781KB 8.64
#> 3 summarize(df_few, y = n()) 46µs 48µs 20451. 448B 31.8
#> 4 summarize(df_few, y = max(x)) 4.26ms 4.61ms 207. 848B 2.01 Created on 2022-07-31 by the reprex package (v2.0.1) So we certainly haven't made any performance improvements by accident. This is a good place to use dtplyr: #> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#> expression min median itr/s…¹
#> <bch:expr> <bch:t> <bch:> <dbl>
#> 1 collect(summarize(dtplyr::lazy_dt(df_mny), y = n())) 36.8ms 43.7ms 18.7
#> 2 collect(summarize(dtplyr::lazy_dt(df_mny), y = max(x))) 58.7ms 63.2ms 13.5
#> 3 collect(summarize(dtplyr::lazy_dt(df_few), y = n())) 18.9ms 20.7ms 36.8
#> 4 collect(summarize(dtplyr::lazy_dt(df_few), y = max(x))) 33.9ms 35.5ms 28.0
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> # variable name ¹`itr/sec`
#> # ℹ Use `colnames()` to see all variable names Created on 2022-07-31 by the reprex package (v2.0.1) dplyr's computational engine is showing it's limitations for larger datasets and it seems every time we improve performance in one place we either make it worse somewhere else or introduce a buggy edge case. I think we will probably need to fix this by reconsidering the built-in backend altogether, rather than patching it with more band aids. So, I'm going to close this issue. While we're certainly still generally thinking about performance in dplyr, tracking this specific issue isn't particularly useful for us. |
I am testing out the dev version of dplyr and have noticed some performance regressions when using summarize with a large number of groups. Calling
n()
with a large number of groups produces a ~400x increased runtime, whereas usingmax()
has ~10x increased runtime.Performance on
0.8.5
Created on 2020-03-21 by the reprex package (v0.3.0)
Performance on current dev version
Created on 2020-03-21 by the reprex package (v0.3.0)
The text was updated successfully, but these errors were encountered: