-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
summarise_all requested #3711
Comments
Here goes an example using B seems to do the trick -- C is by far the slowest solution. An example: library(magrittr)
library(purrr)
library(data.table)
library(dplyr)
library(tictoc)
library(devtools)
# nrow of data
n = 1e6
# number of columns by type
p_int = 100
p_double = 100
p_char = 30
# create columns
set.seed(100)
# grouping columns
id1 = sample(letters, n, rep=T)
id2 = sample(letters, n, rep=T)
# filter columns
f = sample(12, n, rep=T)
# integer columns
ints = matrix(sample(100L, p_int*n, rep=T), ncol=p_int) %>% as.data.table()
names(ints) = paste0("int",seq_along(ints))
# double columns
doubles = (matrix(sample(100, p_double*n, rep=T), ncol=p_double)/7) %>% as.data.table()
names(doubles) = paste0("double",seq_along(doubles))
# character columns
chars = matrix(sample(letters, p_char*n, rep=T), ncol=p_char) %>% as.data.table()
names(chars) = paste0("char",seq_along(chars))
# bind columns
dat = cbind(id1, id2, f, ints, doubles, chars)
# columns to summarise
cols_sum = map_lgl(dat, is.numeric) %>% which() %>% names()
# summarise with mean,max,sum
tic.clearlog()
# A: dplyr
tic("A")
out_a = as_tibble(dat) %>%
dplyr::filter(f < 6) %>%
group_by(id1, id2) %>%
summarise_at(cols_sum, list(max = max, sum = sum, mean = mean))
toc(log=T)
# B: datatable
tic("B")
func_sum = function(x) c(max = max(x), sum = sum(x), mean = mean(x))
out_b = dat[f < 6,
by = .(id1,id2),
as.list(unlist(lapply(.SD, func_sum))),
.SDcols = cols_sum]
toc(log=T)
# C: datatable with summarise_all
tic("C")
out_c = dat[f < 6,
by = .(id1,id2),
summarise_all(.SD, list(max = max, sum = sum, mean = mean)),
.SDcols = cols_sum]
toc(log=T)
# time
tic.log(format = TRUE)
Session info:
|
Thanks @ftvalentini - that's beyond my wildest dreams of reproducibility! Just one thing: can you add the timings that you see too please. That way, when we reproduce it in some weeks time, we know we get the same ballpark timings as you. |
running times:
Note: I had to stop execution of C before it ended :/ |
dplyr seems to be deprecating |
Users can already use
dplyr::summarise_all
on.SD
. But two people said it would still be nice to see built in to data.table :https://twitter.com/Pertplus1/status/1151090599648550912
https://twitter.com/ftvalen/status/1151099177402818561
Next step would be to benchmark: https://twitter.com/MattDowle/status/1151959251029389312
The text was updated successfully, but these errors were encountered: