summarise_all requested #3711

mattdowle · 2019-07-18T20:14:44Z

Users can already use dplyr::summarise_all on .SD. But two people said it would still be nice to see built in to data.table :
https://twitter.com/Pertplus1/status/1151090599648550912
https://twitter.com/ftvalen/status/1151099177402818561
Next step would be to benchmark: https://twitter.com/MattDowle/status/1151959251029389312

The text was updated successfully, but these errors were encountered:

franknarf1 · 2019-07-18T20:34:20Z

Related (if I understand what summarise_all does): #782 and #1063 as colwise()

ftvalentini · 2019-07-18T23:06:47Z

Here goes an example using
A. dplyr with summarise_at
B. datatable with lapply
C. datatable with summarise_all

B seems to do the trick -- C is by far the slowest solution. An example:

library(magrittr)
library(purrr)
library(data.table)
library(dplyr)
library(tictoc)
library(devtools)

# nrow of data
n = 1e6
# number of columns by type
p_int = 100
p_double = 100
p_char = 30

# create columns
set.seed(100)
# grouping columns
id1 = sample(letters, n, rep=T)
id2 = sample(letters, n, rep=T)
# filter columns
f = sample(12, n, rep=T)
# integer columns
ints = matrix(sample(100L, p_int*n, rep=T), ncol=p_int) %>% as.data.table()
names(ints) = paste0("int",seq_along(ints))
# double columns
doubles = (matrix(sample(100, p_double*n, rep=T), ncol=p_double)/7) %>% as.data.table() 
names(doubles) = paste0("double",seq_along(doubles))
# character columns
chars = matrix(sample(letters, p_char*n, rep=T), ncol=p_char) %>% as.data.table() 
names(chars) = paste0("char",seq_along(chars))

# bind columns
dat = cbind(id1, id2, f, ints, doubles, chars) 

# columns to summarise
cols_sum = map_lgl(dat, is.numeric) %>% which() %>% names()


# summarise with mean,max,sum

tic.clearlog()
# A: dplyr
tic("A")
out_a = as_tibble(dat) %>% 
  dplyr::filter(f < 6) %>% 
  group_by(id1, id2) %>%
  summarise_at(cols_sum, list(max = max, sum = sum, mean = mean))
toc(log=T)

# B: datatable
tic("B")
func_sum = function(x) c(max = max(x), sum = sum(x), mean = mean(x))
out_b = dat[f < 6,
          by = .(id1,id2),
          as.list(unlist(lapply(.SD, func_sum))),
          .SDcols = cols_sum]
toc(log=T)

# C: datatable with summarise_all
tic("C")
out_c = dat[f < 6,
          by = .(id1,id2),
          summarise_all(.SD, list(max = max, sum = sum, mean = mean)),
          .SDcols = cols_sum]
toc(log=T)

# time
tic.log(format = TRUE)

Session info:

> devtools::session_info()
Session info ---------------------------------------------------
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 system   x86_64, mingw32             
 ui       RStudio (1.2.1335)          
 language (EN)                        
 collate  Spanish_Argentina.1252      
 tz       America/Buenos_Aires        
 date     2019-07-18                  

Packages -------------------------------------------------------
 package   * version date       source        
 base      * 3.6.0   2019-04-26 local         
 compiler    3.6.0   2019-04-26 local         
 datasets  * 3.6.0   2019-04-26 local         
 devtools  * 1.13.6  2018-06-27 CRAN (R 3.5.1)
 digest      0.6.18  2018-10-10 CRAN (R 3.5.1)
 graphics  * 3.6.0   2019-04-26 local         
 grDevices * 3.6.0   2019-04-26 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)
 methods   * 3.6.0   2019-04-26 local         
 packrat     0.4.9-3 2018-06-01 CRAN (R 3.5.1)
 stats     * 3.6.0   2019-04-26 local         
 tools       3.6.0   2019-04-26 local         
 utils     * 3.6.0   2019-04-26 local         
 withr       2.1.2   2018-03-15 CRAN (R 3.5.0)
>

mattdowle · 2019-07-18T23:34:45Z

Thanks @ftvalentini - that's beyond my wildest dreams of reproducibility! Just one thing: can you add the timings that you see too please. That way, when we reproduce it in some weeks time, we know we get the same ballpark timings as you.

ftvalentini · 2019-07-18T23:59:06Z

running times:

> tic.log(format = TRUE)
[[1]]
[1] "A: 21.08 sec elapsed"

[[2]]
[1] "B: 5.25 sec elapsed"

[[3]]
[1] "C: 868.97 sec elapsed"

Note: I had to stop execution of C before it ended :/

jangorecki · 2020-06-27T22:36:57Z

dplyr seems to be deprecating summarise_all so not sure if we still want to have this... we could eventually use their API for same functionality, whatever it is now, but it is a little bit risky as it may eventually get deprecated as well.

mattdowle added the feature request label Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summarise_all requested #3711

summarise_all requested #3711

mattdowle commented Jul 18, 2019 •

edited

Loading

franknarf1 commented Jul 18, 2019

ftvalentini commented Jul 18, 2019

mattdowle commented Jul 18, 2019

ftvalentini commented Jul 18, 2019

jangorecki commented Jun 27, 2020 •

edited

Loading

summarise_all requested #3711

summarise_all requested #3711

Comments

mattdowle commented Jul 18, 2019 • edited Loading

franknarf1 commented Jul 18, 2019

ftvalentini commented Jul 18, 2019

mattdowle commented Jul 18, 2019

ftvalentini commented Jul 18, 2019

jangorecki commented Jun 27, 2020 • edited Loading

mattdowle commented Jul 18, 2019 •

edited

Loading

jangorecki commented Jun 27, 2020 •

edited

Loading