across() needs to ignore grouping vars #173

hadley · 2021-01-30T16:21:03Z

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars %>% 
  lazy_dt() %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), mean))
#> Source: local data table [3 x 12]
#> Call:   `_DT1`[, .(mpg = mean(mpg), cyl = mean(cyl), disp = mean(disp), 
#>     hp = mean(hp), drat = mean(drat), wt = mean(wt), qsec = mean(qsec), 
#>     vs = mean(vs), am = mean(am), gear = mean(gear), carb = mean(carb)), 
#>     keyby = .(cyl)]
#> Error: Column name `cyl` must not be duplicated.

^{Created on 2021-01-30 by the reprex package (v0.3.0.9001)}

myoung3 · 2021-01-31T01:40:54Z

Translating to a lapply(.SD, FUN) call will fix this (because .SD doesn't contain .BY columns), and it is fine to just lapply over .SD from a performance perspective instead of building each call manually. Michael's tweet was a bit hard to parse and somewhat equivocal, but lapply is translated into repeated FUN calls and is thus optimized. About a decade ago it wasn't optimized, and you might occasionally see warnings about this in some places.

You can see the translation printed with option datatable.verbose turned on:

library(data.table)
options(datatable.verbose=TRUE)
mtcars <- as.data.table(mtcars)
mtcars[, lapply(.SD,mean),by="cyl"]

#Finding groups using forderv ... forder.c received 32 rows and 1 columns
#0.000s elapsed (0.000s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#Getting back original order ... forder.c received a vector type 'integer' length 3
#0.000s elapsed (0.000s cpu) 
#lapply optimization changed j from 'lapply(.SD, mean)' to 'list(mean(mpg), mean(disp), mean(hp), mean(drat), mean(wt), mean(qsec), mean(vs), mean(am), mean(gear), mean(carb))'
#GForce optimized j to 'list(gmean(mpg), gmean(disp), gmean(hp), gmean(drat), gmean(wt), gmean(qsec), gmean(vs), gmean(am), gmean(gear), gmean(carb))'
#Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#gforce assign high and low took 0.020
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#gforce eval took 0.002
#0.022s elapsed (0.204s cpu) #

It's true that someone might turn this optimization off via options(datatable.optimize) which would result in lapply(.SD) being slow), but it's the lowest level of optimization and has been around for a long time. I don't see most users messing with options(datatable.optimize)

See https://www.rdocumentation.org/packages/data.table/versions/1.13.6/topics/datatable.optimize

myoung3 · 2021-01-31T01:56:20Z

Plus, translating to lapply(.SD, FUN) shows the idiomatic data.table approach via dtplyr::show_query rather than translating to code nobody would actually write themselves.

myoung3 · 2021-01-31T04:57:11Z

Actually, the way this is implemented is probably the best approach since the lapply(.SD) approach doesn't generalize well to an arbitrary number of functions.

This is the closest multiple functions comes to working. This is optimized, which is good, but the colnames don't get pasted to function names:

library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c(lapply(.SD,mean),lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#>    cyl     disp        hp     drat   disp   hp  drat
#> 1:   6 183.3143 122.28571 3.585714 1283.2  856 25.10
#> 2:   4 105.1364  82.63636 4.070909 1156.5  909 44.78
#> 3:   8 353.1000 209.21429 3.229286 4943.4 2929 45.21

^{Created on 2021-01-30 by the reprex package (v0.3.0)}

This could be fixed by following it up with a complicated setnames statement, but it's not a good solution.

In theory the following should result in column names such as mean.disp, etc but that doesn't even work because of how the lapply statements get substituted (turn on datatable.verbose to see the substitution).

library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c("mean"=lapply(.SD,mean),"sum"=lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#>    cyl     disp        hp     drat   disp   hp  drat
#> 1:   6 183.3143 122.28571 3.585714 1283.2  856 25.10
#> 2:   4 105.1364  82.63636 4.070909 1156.5  909 44.78
#> 3:   8 353.1000 209.21429 3.229286 4943.4 2929 45.21

#how it should work according to base R named concatenation of named lists
c(A=list(a=1:3,b=1:3),B=list(a=1:3,b=1:3))
#> $A.a
#> [1] 1 2 3
#> 
#> $A.b
#> [1] 1 2 3
#> 
#> $B.a
#> [1] 1 2 3
#> 
#> $B.b
#> [1] 1 2 3

^{Created on 2021-01-30 by the reprex package (v0.3.0)}

Maybe someday these rowwise/colwise functions will get implemented...
Rdatatable/data.table#1063

hadley · 2021-01-31T15:21:08Z

Yeah, I think the generality of across() will be difficult to simulate using lapply() etc.

hadley closed this as completed in 18d8114 Jan 31, 2021

myoung3 mentioned this issue Jul 1, 2021

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] Rdatatable/data.table#4970

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

across() needs to ignore grouping vars #173

across() needs to ignore grouping vars #173

hadley commented Jan 30, 2021

myoung3 commented Jan 31, 2021 •

edited

Loading

myoung3 commented Jan 31, 2021 •

edited

Loading

myoung3 commented Jan 31, 2021 •

edited

Loading

hadley commented Jan 31, 2021

across() needs to ignore grouping vars #173

across() needs to ignore grouping vars #173

Comments

hadley commented Jan 30, 2021

myoung3 commented Jan 31, 2021 • edited Loading

myoung3 commented Jan 31, 2021 • edited Loading

myoung3 commented Jan 31, 2021 • edited Loading

hadley commented Jan 31, 2021

myoung3 commented Jan 31, 2021 •

edited

Loading

myoung3 commented Jan 31, 2021 •

edited

Loading

myoung3 commented Jan 31, 2021 •

edited

Loading