Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise_all requested #3711

Open
mattdowle opened this issue Jul 18, 2019 · 5 comments
Open

summarise_all requested #3711

mattdowle opened this issue Jul 18, 2019 · 5 comments

Comments

@mattdowle
Copy link
Member

mattdowle commented Jul 18, 2019

Users can already use dplyr::summarise_all on .SD. But two people said it would still be nice to see built in to data.table :
https://twitter.com/Pertplus1/status/1151090599648550912
https://twitter.com/ftvalen/status/1151099177402818561
Next step would be to benchmark: https://twitter.com/MattDowle/status/1151959251029389312

@franknarf1
Copy link
Contributor

Related (if I understand what summarise_all does): #782 and #1063 as colwise()

@ftvalentini
Copy link

Here goes an example using
A. dplyr with summarise_at
B. datatable with lapply
C. datatable with summarise_all

B seems to do the trick -- C is by far the slowest solution. An example:

library(magrittr)
library(purrr)
library(data.table)
library(dplyr)
library(tictoc)
library(devtools)

# nrow of data
n = 1e6
# number of columns by type
p_int = 100
p_double = 100
p_char = 30

# create columns
set.seed(100)
# grouping columns
id1 = sample(letters, n, rep=T)
id2 = sample(letters, n, rep=T)
# filter columns
f = sample(12, n, rep=T)
# integer columns
ints = matrix(sample(100L, p_int*n, rep=T), ncol=p_int) %>% as.data.table()
names(ints) = paste0("int",seq_along(ints))
# double columns
doubles = (matrix(sample(100, p_double*n, rep=T), ncol=p_double)/7) %>% as.data.table() 
names(doubles) = paste0("double",seq_along(doubles))
# character columns
chars = matrix(sample(letters, p_char*n, rep=T), ncol=p_char) %>% as.data.table() 
names(chars) = paste0("char",seq_along(chars))

# bind columns
dat = cbind(id1, id2, f, ints, doubles, chars) 

# columns to summarise
cols_sum = map_lgl(dat, is.numeric) %>% which() %>% names()


# summarise with mean,max,sum

tic.clearlog()
# A: dplyr
tic("A")
out_a = as_tibble(dat) %>% 
  dplyr::filter(f < 6) %>% 
  group_by(id1, id2) %>%
  summarise_at(cols_sum, list(max = max, sum = sum, mean = mean))
toc(log=T)

# B: datatable
tic("B")
func_sum = function(x) c(max = max(x), sum = sum(x), mean = mean(x))
out_b = dat[f < 6,
          by = .(id1,id2),
          as.list(unlist(lapply(.SD, func_sum))),
          .SDcols = cols_sum]
toc(log=T)

# C: datatable with summarise_all
tic("C")
out_c = dat[f < 6,
          by = .(id1,id2),
          summarise_all(.SD, list(max = max, sum = sum, mean = mean)),
          .SDcols = cols_sum]
toc(log=T)

# time
tic.log(format = TRUE)

Session info:

> devtools::session_info()
Session info ---------------------------------------------------
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 system   x86_64, mingw32             
 ui       RStudio (1.2.1335)          
 language (EN)                        
 collate  Spanish_Argentina.1252      
 tz       America/Buenos_Aires        
 date     2019-07-18                  

Packages -------------------------------------------------------
 package   * version date       source        
 base      * 3.6.0   2019-04-26 local         
 compiler    3.6.0   2019-04-26 local         
 datasets  * 3.6.0   2019-04-26 local         
 devtools  * 1.13.6  2018-06-27 CRAN (R 3.5.1)
 digest      0.6.18  2018-10-10 CRAN (R 3.5.1)
 graphics  * 3.6.0   2019-04-26 local         
 grDevices * 3.6.0   2019-04-26 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)
 methods   * 3.6.0   2019-04-26 local         
 packrat     0.4.9-3 2018-06-01 CRAN (R 3.5.1)
 stats     * 3.6.0   2019-04-26 local         
 tools       3.6.0   2019-04-26 local         
 utils     * 3.6.0   2019-04-26 local         
 withr       2.1.2   2018-03-15 CRAN (R 3.5.0)
> 

@mattdowle
Copy link
Member Author

Thanks @ftvalentini - that's beyond my wildest dreams of reproducibility! Just one thing: can you add the timings that you see too please. That way, when we reproduce it in some weeks time, we know we get the same ballpark timings as you.

@ftvalentini
Copy link

running times:

> tic.log(format = TRUE)
[[1]]
[1] "A: 21.08 sec elapsed"

[[2]]
[1] "B: 5.25 sec elapsed"

[[3]]
[1] "C: 868.97 sec elapsed"

Note: I had to stop execution of C before it ended :/

@jangorecki
Copy link
Member

jangorecki commented Jun 27, 2020

dplyr seems to be deprecating summarise_all so not sure if we still want to have this... we could eventually use their API for same functionality, whatever it is now, but it is a little bit risky as it may eventually get deprecated as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants