Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarize performance drop off with many groups #5017

Closed
kriemo opened this issue Mar 21, 2020 · 13 comments
Closed

summarize performance drop off with many groups #5017

kriemo opened this issue Mar 21, 2020 · 13 comments

Comments

@kriemo
Copy link

kriemo commented Mar 21, 2020

I am testing out the dev version of dplyr and have noticed some performance regressions when using summarize with a large number of groups. Calling n() with a large number of groups produces a ~400x increased runtime, whereas using max() has ~10x increased runtime.

Performance on 0.8.5

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.5'

set.seed(42)
many_grps <- data.frame(grp = sample(1:1e5,
                                     1e6,
                                     replace = TRUE),
                        val = runif(1e6)) %>% 
  group_by(grp)
n_groups(many_grps)
#> [1] 99997

set.seed(42)
few_grps <- data.frame(grp = sample(1:100,
                                    1e6,
                                    replace = TRUE),
                       val = runif(1e6)) %>% 
  group_by(grp)
n_groups(few_grps)
#> [1] 100

microbenchmark::microbenchmark(summarize(many_grps, n = n()),
                               summarize(many_grps, m = max(val)),
                               summarize(few_grps, n = n()),
                               summarize(few_grps, m = max(val)),
                               times = 5,
                               unit = 'ms')
#> Unit: milliseconds
#>                                expr       min        lq      mean    median
#>       summarize(many_grps, n = n())  2.474665  2.531869  2.786606  2.743778
#>  summarize(many_grps, m = max(val)) 17.693114 19.297248 22.774355 20.640482
#>        summarize(few_grps, n = n())  0.144234  0.154476  0.182271  0.175776
#>   summarize(few_grps, m = max(val))  8.792012 10.393226 11.908963 10.482098
#>         uq       max neval cld
#>   3.037315  3.145401     5 a  
#>  27.791042 28.449888     5   c
#>   0.190168  0.246701     5 a  
#>  14.835186 15.042294     5  b

Created on 2020-03-21 by the reprex package (v0.3.0)

Performance on current dev version

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'

set.seed(42)
many_grps <- data.frame(grp = sample(1:1e5,
                                     1e6,
                                     replace = TRUE),
                        val = runif(1e6)) %>% 
  group_by(grp)
n_groups(many_grps)
#> [1] 99997

set.seed(42)
few_grps <- data.frame(grp = sample(1:100,
                                    1e6,
                                    replace = TRUE),
                       val = runif(1e6)) %>% 
  group_by(grp)
n_groups(few_grps)
#> [1] 100

microbenchmark::microbenchmark(summarize(many_grps, n = n()),
                               summarize(many_grps, m = max(val)),
                               summarize(few_grps, n = n()),
                               summarize(few_grps, m = max(val)),
                               times = 5,
                               unit = 'ms')
#> Unit: milliseconds
#>                                expr         min          lq        mean
#>       summarize(many_grps, n = n()) 1129.489705 1170.905902 1177.632328
#>  summarize(many_grps, m = max(val))  164.942559  180.437278  212.037870
#>        summarize(few_grps, n = n())    2.286928    2.307006    2.408776
#>   summarize(few_grps, m = max(val))   13.396918   14.531448   15.026723
#>       median         uq        max neval cld
#>  1177.479608 1188.72916 1221.55727     5   c
#>   205.139373  218.82458  290.84556     5  b 
#>     2.324175    2.40646    2.71931     5 a  
#>    14.850954   15.62192   16.73238     5 a

Created on 2020-03-21 by the reprex package (v0.3.0)

@ha0ye

This comment was marked as outdated.

@romainfrancois
Copy link
Member

We know about this and we have a plan, it won't happen in 1.0.0 though.

@kriemo

This comment was marked as outdated.

@romainfrancois
Copy link
Member

Yes, what happens is that some functions previously benefited from some ad hoc handling we called "hybrid evaluation".

With the major rewrite of dplyr we have not yet brought back such a mechanism, and we plan to address it in version 1.1.0, this is one of my challenges for post 1.0.0 release.

@pfreese
Copy link

pfreese commented Jul 29, 2020

Hi, my organization upgraded to dplyr v1.0.0 and we are noticing a significant increase in computation time (e.g., from 30 seconds to an hour for a data.frame with 20k columns). Just wanted to chime in and underscore that addressing this in a new version ASAP will be much appreciated! Thanks!

@hadley
Copy link
Member

hadley commented Jul 29, 2020

@pfreese is there any chance you could provide a reprex based on random data? It's possible that your issue is different from the others on the thread.

@pfreese
Copy link

pfreese commented Jul 29, 2020

m0 <- matrix(0, 50, 20000)
groups <- sample(1:10, 50, replace = TRUE)

m1 <- apply(m0, c(1,2), function(x) sample(c(0,1),1)) %>%
  as.data.frame() %>%
  mutate(groups = groups)

m2 <- m1 %>%
  group_by(groups) %>%
  summarise_all(list(sum))

@twu13
Copy link

twu13 commented Aug 4, 2020

Using @pfreese's code example I am seeing a >100x difference in computation time on my Macbook for the last step (creation of m2) between dplyr versions 0.8.5 and 1.0.0. Is that expected?

@hadley
Copy link
Member

hadley commented Aug 4, 2020

@fullmetalomelette obviously that’s not expected.

@romainfrancois

This comment was marked as outdated.

@Fablepongiste
Copy link

Fablepongiste commented Aug 12, 2021

Was just wondering if there were any updates on this / when dplyr 1.1.0 (if that when this will be fixed) would be released ?

Overall finding dplyr 1.+ much slower than 0.8.+, in particular when operations on grouped data

@twest820
Copy link

I'm curious about this too. I'm doing some straightforward %>% summarize() %>% operations which are taking over two minutes even though they're running on Intel 8th gen at 3.8 GHz and have only ~700k observations in ~200k groups. Worse, the output tibble often isn't added to the R workspace, leading to script failure because it's missing or because recalculating it with updates didn't actually update the workspace to the new version of the tibble. Curiously, the first time summarize() is called is noticeably more reliable than recalculations meaning, so far at least, I can mostly mitigate the issue by restarting the R session and caching the summarized results in a data file. While something of a hassle due to the two minute calculation grind and need to keep files synchronized, it's better than the alternative.

This sort of calculation isn't really amenable to a repex even though the data size is small (input tibble is <200MB in memory) but if there are log files or other diagnostics I can collect I'm happy to provide those. I believe I can also make my data files and a fairly simple script available via email and cloud storage if someone wants to take a closer look.

@hadley
Copy link
Member

hadley commented Jul 31, 2022

The issues with n_distinct() have been mostly handled because we've considerably improved the performance of n_distinct().

For the original reprex, and current dev version, I see:

library(dplyr, warn.conflicts = FALSE)
n <- 1e6
seq_n <- function(m, n) rep(1:m, length.out = n)

df_mny <- tibble(g = seq_n(1e5, n), x = runif(n)) |> group_by(g)
df_few <- tibble(g = seq_n(100, n), x = runif(n)) |> group_by(g)

bench::mark(
  df_mny |> summarize(y = n()),
  df_mny |> summarize(y = max(x)),
  df_few |> summarize(y = n()),
  df_few |> summarize(y = max(x)),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarize(df_mny, y = n())    652.46ms 652.46ms      1.53    6.47MB     18.4
#> 2 summarize(df_mny, y = max(x)) 132.74ms 143.84ms      6.33     8.4MB     14.2
#> 3 summarize(df_few, y = n())    953.79µs 987.54µs    877.      7.28KB     14.0
#> 4 summarize(df_few, y = max(x))   4.66ms   5.43ms    167.      7.64MB     31.9

Created on 2022-07-31 by the reprex package (v2.0.1)

And for 0.8.5:

#> # A tibble: 4 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarize(df_mny, y = n())       1.8ms   1.91ms      508.     414KB    10.4 
#> 2 summarize(df_mny, y = max(x))   5.19ms   5.33ms      186.     781KB     8.64
#> 3 summarize(df_few, y = n())        46µs     48µs    20451.      448B    31.8 
#> 4 summarize(df_few, y = max(x))   4.26ms   4.61ms      207.      848B     2.01

Created on 2022-07-31 by the reprex package (v2.0.1)

So we certainly haven't made any performance improvements by accident.

This is a good place to use dtplyr:

#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression                                                  min median itr/s…¹
#>   <bch:expr>                                              <bch:t> <bch:>   <dbl>
#> 1 collect(summarize(dtplyr::lazy_dt(df_mny), y = n()))     36.8ms 43.7ms    18.7
#> 2 collect(summarize(dtplyr::lazy_dt(df_mny), y = max(x)))  58.7ms 63.2ms    13.5
#> 3 collect(summarize(dtplyr::lazy_dt(df_few), y = n()))     18.9ms 20.7ms    36.8
#> 4 collect(summarize(dtplyr::lazy_dt(df_few), y = max(x)))  33.9ms 35.5ms    28.0
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> #   variable name ¹​`itr/sec`
#> # ℹ Use `colnames()` to see all variable names

Created on 2022-07-31 by the reprex package (v2.0.1)

dplyr's computational engine is showing it's limitations for larger datasets and it seems every time we improve performance in one place we either make it worse somewhere else or introduce a buggy edge case. I think we will probably need to fix this by reconsidering the built-in backend altogether, rather than patching it with more band aids. So, I'm going to close this issue. While we're certainly still generally thinking about performance in dplyr, tracking this specific issue isn't particularly useful for us.

@hadley hadley closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants