summarize performance drop off with many groups #5017

kriemo · 2020-03-21T20:52:29Z

I am testing out the dev version of dplyr and have noticed some performance regressions when using summarize with a large number of groups. Calling n() with a large number of groups produces a ~400x increased runtime, whereas using max() has ~10x increased runtime.

Performance on 0.8.5

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.5'

set.seed(42)
many_grps <- data.frame(grp = sample(1:1e5,
                                     1e6,
                                     replace = TRUE),
                        val = runif(1e6)) %>% 
  group_by(grp)
n_groups(many_grps)
#> [1] 99997

set.seed(42)
few_grps <- data.frame(grp = sample(1:100,
                                    1e6,
                                    replace = TRUE),
                       val = runif(1e6)) %>% 
  group_by(grp)
n_groups(few_grps)
#> [1] 100

microbenchmark::microbenchmark(summarize(many_grps, n = n()),
                               summarize(many_grps, m = max(val)),
                               summarize(few_grps, n = n()),
                               summarize(few_grps, m = max(val)),
                               times = 5,
                               unit = 'ms')
#> Unit: milliseconds
#>                                expr       min        lq      mean    median
#>       summarize(many_grps, n = n())  2.474665  2.531869  2.786606  2.743778
#>  summarize(many_grps, m = max(val)) 17.693114 19.297248 22.774355 20.640482
#>        summarize(few_grps, n = n())  0.144234  0.154476  0.182271  0.175776
#>   summarize(few_grps, m = max(val))  8.792012 10.393226 11.908963 10.482098
#>         uq       max neval cld
#>   3.037315  3.145401     5 a  
#>  27.791042 28.449888     5   c
#>   0.190168  0.246701     5 a  
#>  14.835186 15.042294     5  b

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

Performance on current dev version

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.99.9002'

set.seed(42)
many_grps <- data.frame(grp = sample(1:1e5,
                                     1e6,
                                     replace = TRUE),
                        val = runif(1e6)) %>% 
  group_by(grp)
n_groups(many_grps)
#> [1] 99997

set.seed(42)
few_grps <- data.frame(grp = sample(1:100,
                                    1e6,
                                    replace = TRUE),
                       val = runif(1e6)) %>% 
  group_by(grp)
n_groups(few_grps)
#> [1] 100

microbenchmark::microbenchmark(summarize(many_grps, n = n()),
                               summarize(many_grps, m = max(val)),
                               summarize(few_grps, n = n()),
                               summarize(few_grps, m = max(val)),
                               times = 5,
                               unit = 'ms')
#> Unit: milliseconds
#>                                expr         min          lq        mean
#>       summarize(many_grps, n = n()) 1129.489705 1170.905902 1177.632328
#>  summarize(many_grps, m = max(val))  164.942559  180.437278  212.037870
#>        summarize(few_grps, n = n())    2.286928    2.307006    2.408776
#>   summarize(few_grps, m = max(val))   13.396918   14.531448   15.026723
#>       median         uq        max neval cld
#>  1177.479608 1188.72916 1221.55727     5   c
#>   205.139373  218.82458  290.84556     5  b 
#>     2.324175    2.40646    2.71931     5 a  
#>    14.850954   15.62192   16.73238     5 a

^{Created on 2020-03-21 by the reprex package (v0.3.0)}

The text was updated successfully, but these errors were encountered:

romainfrancois · 2020-03-22T09:34:10Z

We know about this and we have a plan, it won't happen in 1.0.0 though.

romainfrancois · 2020-03-23T07:47:02Z

Yes, what happens is that some functions previously benefited from some ad hoc handling we called "hybrid evaluation".

With the major rewrite of dplyr we have not yet brought back such a mechanism, and we plan to address it in version 1.1.0, this is one of my challenges for post 1.0.0 release.

pfreese · 2020-07-29T18:19:25Z

Hi, my organization upgraded to dplyr v1.0.0 and we are noticing a significant increase in computation time (e.g., from 30 seconds to an hour for a data.frame with 20k columns). Just wanted to chime in and underscore that addressing this in a new version ASAP will be much appreciated! Thanks!

hadley · 2020-07-29T18:27:49Z

@pfreese is there any chance you could provide a reprex based on random data? It's possible that your issue is different from the others on the thread.

pfreese · 2020-07-29T22:24:08Z

m0 <- matrix(0, 50, 20000)
groups <- sample(1:10, 50, replace = TRUE)

m1 <- apply(m0, c(1,2), function(x) sample(c(0,1),1)) %>%
  as.data.frame() %>%
  mutate(groups = groups)

m2 <- m1 %>%
  group_by(groups) %>%
  summarise_all(list(sum))

twu13 · 2020-08-04T16:59:01Z

Using @pfreese's code example I am seeing a >100x difference in computation time on my Macbook for the last step (creation of m2) between dplyr versions 0.8.5 and 1.0.0. Is that expected?

hadley · 2020-08-04T17:21:02Z

@fullmetalomelette obviously that’s not expected.

Fablepongiste · 2021-08-12T11:12:03Z

Was just wondering if there were any updates on this / when dplyr 1.1.0 (if that when this will be fixed) would be released ?

Overall finding dplyr 1.+ much slower than 0.8.+, in particular when operations on grouped data

twest820 · 2021-10-20T20:17:47Z

I'm curious about this too. I'm doing some straightforward %>% summarize() %>% operations which are taking over two minutes even though they're running on Intel 8th gen at 3.8 GHz and have only ~700k observations in ~200k groups. Worse, the output tibble often isn't added to the R workspace, leading to script failure because it's missing or because recalculating it with updates didn't actually update the workspace to the new version of the tibble. Curiously, the first time summarize() is called is noticeably more reliable than recalculations meaning, so far at least, I can mostly mitigate the issue by restarting the R session and caching the summarized results in a data file. While something of a hassle due to the two minute calculation grind and need to keep files synchronized, it's better than the alternative.

This sort of calculation isn't really amenable to a repex even though the data size is small (input tibble is <200MB in memory) but if there are log files or other diagnostics I can collect I'm happy to provide those. I believe I can also make my data files and a fairly simple script available via email and cloud storage if someone wants to take a closer look.

hadley · 2022-07-31T22:23:28Z

The issues with n_distinct() have been mostly handled because we've considerably improved the performance of n_distinct().

For the original reprex, and current dev version, I see:

library(dplyr, warn.conflicts = FALSE)
n <- 1e6
seq_n <- function(m, n) rep(1:m, length.out = n)

df_mny <- tibble(g = seq_n(1e5, n), x = runif(n)) |> group_by(g)
df_few <- tibble(g = seq_n(100, n), x = runif(n)) |> group_by(g)

bench::mark(
  df_mny |> summarize(y = n()),
  df_mny |> summarize(y = max(x)),
  df_few |> summarize(y = n()),
  df_few |> summarize(y = max(x)),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarize(df_mny, y = n())    652.46ms 652.46ms      1.53    6.47MB     18.4
#> 2 summarize(df_mny, y = max(x)) 132.74ms 143.84ms      6.33     8.4MB     14.2
#> 3 summarize(df_few, y = n())    953.79µs 987.54µs    877.      7.28KB     14.0
#> 4 summarize(df_few, y = max(x))   4.66ms   5.43ms    167.      7.64MB     31.9

^{Created on 2022-07-31 by the reprex package (v2.0.1)}

And for 0.8.5:

#> # A tibble: 4 × 6
#>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 summarize(df_mny, y = n())       1.8ms   1.91ms      508.     414KB    10.4 
#> 2 summarize(df_mny, y = max(x))   5.19ms   5.33ms      186.     781KB     8.64
#> 3 summarize(df_few, y = n())        46µs     48µs    20451.      448B    31.8 
#> 4 summarize(df_few, y = max(x))   4.26ms   4.61ms      207.      848B     2.01

^{Created on 2022-07-31 by the reprex package (v2.0.1)}

So we certainly haven't made any performance improvements by accident.

This is a good place to use dtplyr:

#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 × 6
#>   expression                                                  min median itr/s…¹
#>   <bch:expr>                                              <bch:t> <bch:>   <dbl>
#> 1 collect(summarize(dtplyr::lazy_dt(df_mny), y = n()))     36.8ms 43.7ms    18.7
#> 2 collect(summarize(dtplyr::lazy_dt(df_mny), y = max(x)))  58.7ms 63.2ms    13.5
#> 3 collect(summarize(dtplyr::lazy_dt(df_few), y = n()))     18.9ms 20.7ms    36.8
#> 4 collect(summarize(dtplyr::lazy_dt(df_few), y = max(x)))  33.9ms 35.5ms    28.0
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> #   variable name ¹`itr/sec`
#> # ℹ Use `colnames()` to see all variable names

^{Created on 2022-07-31 by the reprex package (v2.0.1)}

dplyr's computational engine is showing it's limitations for larger datasets and it seems every time we improve performance in one place we either make it worse somewhere else or introduce a buggy edge case. I think we will probably need to fix this by reconsidering the built-in backend altogether, rather than patching it with more band aids. So, I'm going to close this issue. While we're certainly still generally thinking about performance in dplyr, tracking this specific issue isn't particularly useful for us.

This comment was marked as outdated.

Sign in to view

ha0ye mentioned this issue Mar 21, 2020

New cran version of dplyr weecology/portalr#241

Closed

romainfrancois added this to the 1.1.0 milestone Mar 22, 2020

This comment was marked as outdated.

Sign in to view

kriemo mentioned this issue Mar 22, 2020

fix build against dev dplyr rnabioco/valr#353

Closed

hadley added performance 🚀 verbs 🏃‍♀️ labels Mar 26, 2020

romainfrancois mentioned this issue Aug 11, 2020

speedup <DataMask>$set(name, chunks) #5474

Merged

This comment was marked as outdated.

Sign in to view

romainfrancois added grouping 👨‍👩‍👧‍👦 and removed verbs 🏃‍♀️ grouping 👨‍👩‍👧‍👦 labels Oct 1, 2021

pgramme mentioned this issue May 16, 2022

vec_unique_count slow for factors r-lib/vctrs#1560

Open

hadley closed this as not planned Won't fix, can't repro, duplicate, stale Jul 31, 2022

johpiip mentioned this issue May 22, 2023

Improve performance with dtplyr HSLdevcom/mal-effect-calculations#74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summarize performance drop off with many groups #5017

summarize performance drop off with many groups #5017

kriemo commented Mar 21, 2020

This comment was marked as outdated.

romainfrancois commented Mar 22, 2020

This comment was marked as outdated.

romainfrancois commented Mar 23, 2020

pfreese commented Jul 29, 2020

hadley commented Jul 29, 2020

pfreese commented Jul 29, 2020

twu13 commented Aug 4, 2020

hadley commented Aug 4, 2020

This comment was marked as outdated.

Fablepongiste commented Aug 12, 2021 •

edited

Loading

twest820 commented Oct 20, 2021

hadley commented Jul 31, 2022

summarize performance drop off with many groups #5017

summarize performance drop off with many groups #5017

Comments

kriemo commented Mar 21, 2020

This comment was marked as outdated.

romainfrancois commented Mar 22, 2020

This comment was marked as outdated.

romainfrancois commented Mar 23, 2020

pfreese commented Jul 29, 2020

hadley commented Jul 29, 2020

pfreese commented Jul 29, 2020

twu13 commented Aug 4, 2020

hadley commented Aug 4, 2020

This comment was marked as outdated.

Fablepongiste commented Aug 12, 2021 • edited Loading

twest820 commented Oct 20, 2021

hadley commented Jul 31, 2022

Fablepongiste commented Aug 12, 2021 •

edited

Loading