-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up rowwise() and colwise() functions for use in .SD #1063
Comments
This comment has been minimized.
This comment has been minimized.
I'm not convinced this is needed, especially not col-wise. |
@izahn nobody is reinventing anything here. It is a necessity. I'm not sure why you're getting worked up.
Out of curiosity, do you use |
@arunsrinivasan I'm not worked up at all, sorry if I did something to give you that impression. As background (and because you asked about dplyr), I was a happy user of plyr but when dplyr came around I was dismayed by the direction it took of completely re-implementing data manipulation in R. I much prefer to the data.table approach of sticking with idioms that are applicable everywhere rather than the walled garden approach of dplyr. I'm just trying to encourage data.table developers not to go down the dplyr road of creating suites of functions that only work well inside the workflow dictated by the package, and to encourage reuse of existing functions and idioms. |
@izahn thanks for clarifying. I really read your previous message differently . Sorry about that. I agree with you entirely on (not forcing) the walled garden approach. Unfortunately there's not a clean equivalent for applying multiple functions to multiple columns each.. (or I'm at least not aware of it). For example: dt[, c(lapply(.SD, mean), lapply(.SD, sum)), by=z] requires good knowledge of base R (which I don't think everyone cares for these days). And then there's the readability issue that people are quite worried about these days (as opposed to understandability). I'll try asking on SO or R-help if there's a way using base-R (or let me know if you can think of it). Otherwise, I'm not sure if it's possible to avoid it, since users really require this functionality. |
Thinking a bit more about this, I think, function like dt[, lapply(.SD, colwise(sum, mean)), by=z] for example. The extra arguments can go after In this case, the dt[, lapply(.SD, rowwise(...))] # Edit: hm.. this isn't quite right, really. and perhaps these are easy to query optimise internally. |
Thank you for following up @arunsrinivasan. I like the
|
I somewhat agree with @izahn. I think I would prefer if you could use |
@ecoRoland I'd rather see (more) optimization behind the scenes (as is already done for
|
Dear Arun, |
this is a matter of substituting
what if In terms of API the simplest but still usable would be to add new function, lets say How rowwise/rowapply should work when no dt[, rowapply(.SD, sum), .SDcols=v1:v2]
dt[, v1+v2] ? IMO most of |
Hey @jangorecki how easy would it be to just wrap this into |
Another example for colwise from SO: https://stackoverflow.com/questions/57386580/what-is-the-equivalent-of-mutate-at-dplyr-in-data-table OP wants to apply multiple functions to a set of columns and have the results appear in a particular order with a particular naming convention |
One suggestion from a user: It would be super convenient if one could apply one function to a group of variables and another function to another group; similar to this stata syntax: collapse (mean) var1 var2 (sum) var3 var4, by(group) My suggestion would be to allow the .SDcols argument to take a list, similar to the measure.vars argument in melt.data.table. Maybe this could also work with the patterns function. I.e., like this: D[, lapply(.SD, colwise(mean, sum)), .SDcols = .(patterns("x"), patterns("y"))] |
Your proposed syntax diverges from base R syntax too much IMO. x_cols = grep(...)
y_cols = grep(...)
D[, c(lapply(.SD[,x_cols], mean), lapply(.SD[,y_cols], sum)), .SDcols = c(x_cols, y_cols)] No new magic needed inside DT, and magic usually is at the cost of consistency. |
Yes, but I thought sub setting columns in .SD is supposed to be avoided and this might be cumbersome for more than 2 functions. However, I don't feel qualified to comment on the consistency issue :0 |
For more functions it can be made using helper function.
|
The idea is to implement all possible functions in C to directly operate on lists. To begin with, maybe:
Implementations should be for both row- and column-wise for lists/data.tables.
This'll enable us to:
This will:
lapply
which makes it tedious to aggregate using multiple functions.The text was updated successfully, but these errors were encountered: