Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

across() needs to ignore grouping vars #173

Closed
hadley opened this issue Jan 30, 2021 · 4 comments
Closed

across() needs to ignore grouping vars #173

hadley opened this issue Jan 30, 2021 · 4 comments

Comments

@hadley
Copy link
Member

hadley commented Jan 30, 2021

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

mtcars %>% 
  lazy_dt() %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), mean))
#> Source: local data table [3 x 12]
#> Call:   `_DT1`[, .(mpg = mean(mpg), cyl = mean(cyl), disp = mean(disp), 
#>     hp = mean(hp), drat = mean(drat), wt = mean(wt), qsec = mean(qsec), 
#>     vs = mean(vs), am = mean(am), gear = mean(gear), carb = mean(carb)), 
#>     keyby = .(cyl)]
#> Error: Column name `cyl` must not be duplicated.

Created on 2021-01-30 by the reprex package (v0.3.0.9001)

@myoung3
Copy link

myoung3 commented Jan 31, 2021

Translating to a lapply(.SD, FUN) call will fix this (because .SD doesn't contain .BY columns), and it is fine to just lapply over .SD from a performance perspective instead of building each call manually. Michael's tweet was a bit hard to parse and somewhat equivocal, but lapply is translated into repeated FUN calls and is thus optimized. About a decade ago it wasn't optimized, and you might occasionally see warnings about this in some places.

You can see the translation printed with option datatable.verbose turned on:

library(data.table)
options(datatable.verbose=TRUE)
mtcars <- as.data.table(mtcars)
mtcars[, lapply(.SD,mean),by="cyl"]

#Finding groups using forderv ... forder.c received 32 rows and 1 columns
#0.000s elapsed (0.000s cpu) 
#Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#Getting back original order ... forder.c received a vector type 'integer' length 3
#0.000s elapsed (0.000s cpu) 
#lapply optimization changed j from 'lapply(.SD, mean)' to 'list(mean(mpg), mean(disp), mean(hp), mean(drat), mean(wt), mean(qsec), mean(vs), mean(am), mean(gear), mean(carb))'
#GForce optimized j to 'list(gmean(mpg), gmean(disp), gmean(hp), gmean(drat), gmean(wt), gmean(qsec), gmean(vs), gmean(am), gmean(gear), gmean(carb))'
#Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#gforce assign high and low took 0.020
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#This gsum took (narm=FALSE) ... gather took ... 0.000s
#0.000s
#gforce eval took 0.002
#0.022s elapsed (0.204s cpu) #

It's true that someone might turn this optimization off via options(datatable.optimize) which would result in lapply(.SD) being slow), but it's the lowest level of optimization and has been around for a long time. I don't see most users messing with options(datatable.optimize)

See https://www.rdocumentation.org/packages/data.table/versions/1.13.6/topics/datatable.optimize

@myoung3
Copy link

myoung3 commented Jan 31, 2021

Plus, translating to lapply(.SD, FUN) shows the idiomatic data.table approach via dtplyr::show_query rather than translating to code nobody would actually write themselves.

@myoung3
Copy link

myoung3 commented Jan 31, 2021

Actually, the way this is implemented is probably the best approach since the lapply(.SD) approach doesn't generalize well to an arbitrary number of functions.

This is the closest multiple functions comes to working. This is optimized, which is good, but the colnames don't get pasted to function names:

library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c(lapply(.SD,mean),lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#>    cyl     disp        hp     drat   disp   hp  drat
#> 1:   6 183.3143 122.28571 3.585714 1283.2  856 25.10
#> 2:   4 105.1364  82.63636 4.070909 1156.5  909 44.78
#> 3:   8 353.1000 209.21429 3.229286 4943.4 2929 45.21

Created on 2021-01-30 by the reprex package (v0.3.0)

This could be fixed by following it up with a complicated setnames statement, but it's not a good solution.

In theory the following should result in column names such as mean.disp, etc but that doesn't even work because of how the lapply statements get substituted (turn on datatable.verbose to see the substitution).

library(data.table)
mtcarsdt <- as.data.table(mtcars)
mtcarsdt[, c("mean"=lapply(.SD,mean),"sum"=lapply(.SD,sum)), by="cyl",.SDcols=3:5]
#>    cyl     disp        hp     drat   disp   hp  drat
#> 1:   6 183.3143 122.28571 3.585714 1283.2  856 25.10
#> 2:   4 105.1364  82.63636 4.070909 1156.5  909 44.78
#> 3:   8 353.1000 209.21429 3.229286 4943.4 2929 45.21

#how it should work according to base R named concatenation of named lists
c(A=list(a=1:3,b=1:3),B=list(a=1:3,b=1:3))
#> $A.a
#> [1] 1 2 3
#> 
#> $A.b
#> [1] 1 2 3
#> 
#> $B.a
#> [1] 1 2 3
#> 
#> $B.b
#> [1] 1 2 3

Created on 2021-01-30 by the reprex package (v0.3.0)

Maybe someday these rowwise/colwise functions will get implemented...
Rdatatable/data.table#1063

@hadley
Copy link
Member Author

hadley commented Jan 31, 2021

Yeah, I think the generality of across() will be difficult to simulate using lapply() etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants