Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up rowwise() and colwise() functions for use in .SD #1063

Open
6 tasks
arunsrinivasan opened this issue Mar 4, 2015 · 20 comments
Open
6 tasks

Set up rowwise() and colwise() functions for use in .SD #1063

arunsrinivasan opened this issue Mar 4, 2015 · 20 comments
Labels
feature request top request One of our most-requested issues

Comments

@arunsrinivasan
Copy link
Member

The idea is to implement all possible functions in C to directly operate on lists. To begin with, maybe:

  • rowmins
  • rowmaxs
  • rowmeans
  • rowsums
  • rowvars
  • rowsds
  • ...

Implementations should be for both row- and column-wise for lists/data.tables.

This'll enable us to:

DT[, rowwise(.SD, <functions>)]
DT[, colwise(.SD, <functions>)]

This will:

  • overcome the current limitation of lapply which makes it tedious to aggregate using multiple functions.
  • provide both row- and col-wise operations for data.tables, where most common functions will be implemented and query optimised to use them automatically.
@skanskan

This comment has been minimized.

@arunsrinivasan arunsrinivasan added this to the v1.9.8 milestone Jul 7, 2015
@izahn
Copy link

izahn commented Jul 7, 2015

I'm not convinced this is needed, especially not col-wise. lapply already works well, I don't think we need another wheel for that. For row-wise we already have apply, rowStats in the fUtilities package, row* functions in the matrixStats package, etc. etc. I don't think this needs to be reinvented in data.table.

@arunsrinivasan
Copy link
Member Author

@izahn nobody is reinventing anything here. It is a necessity. I'm not sure why you're getting worked up.

  1. rowwise() - apply() doesn't cut it because it converts the input object to a matrix first. And that's an absolute waste. We want to be able to do things quite efficiently. And definitely anything is more efficient than having to allocate memory to just create a matrix out of the same data!

  2. rowStats() - I'm not aware of this package. Good to know. But if it works on matrices, then it's a no-go as well. Because it comes back to (1). And even if it works on data.frames, then the issue is that we won't be able to escape eval() from C-side by using those functions from other packages. Evaluating functions for each row is costly. And we'd most certainly want to avoid it.

  3. colwise() - data.table has always tried to use base R functions whenever possible. That is the reason why we have not had any such functions implemented until now. But there have been feature requests / questions for functionality like dplyr::summarise_each(). There's no equivalent for this in base R. No, lapply() and mapply() / Map() both don't cut it. Unless there is a way to apply multiple functions, each, to multiple columns using base R's apply family, that looks as clean as a simple lapply(), there seems to be no reason to not implement this functionality.

    Even here, avoiding eval() cost is a priority. The GForce family of functions in data.table (inspired by dplyr's hybrid evaluation) or the hybrid evaluation family of functions in dplyr do precisely that. And having own versions also helps parallelising at a later point. (Something that Romain touched upon in his keynote speech at UseR'15). We'd like to extend it to more common functions so that the implementations are as efficient as possible. This is tied closely to the philosophy of data.table.

Out of curiosity, do you use data.table package directly or through dplyr()? If you do use directly, I can't think of a reason why you'd not want this feature.. :-O

@izahn
Copy link

izahn commented Jul 7, 2015

@arunsrinivasan I'm not worked up at all, sorry if I did something to give you that impression. As background (and because you asked about dplyr), I was a happy user of plyr but when dplyr came around I was dismayed by the direction it took of completely re-implementing data manipulation in R. I much prefer to the data.table approach of sticking with idioms that are applicable everywhere rather than the walled garden approach of dplyr. I'm just trying to encourage data.table developers not to go down the dplyr road of creating suites of functions that only work well inside the workflow dictated by the package, and to encourage reuse of existing functions and idioms.

@arunsrinivasan
Copy link
Member Author

@izahn thanks for clarifying. I really read your previous message differently . Sorry about that.

I agree with you entirely on (not forcing) the walled garden approach. Unfortunately there's not a clean equivalent for applying multiple functions to multiple columns each.. (or I'm at least not aware of it). For example:

dt[, c(lapply(.SD, mean), lapply(.SD, sum)), by=z]

requires good knowledge of base R (which I don't think everyone cares for these days). And then there's the readability issue that people are quite worried about these days (as opposed to understandability).

I'll try asking on SO or R-help if there's a way using base-R (or let me know if you can think of it). Otherwise, I'm not sure if it's possible to avoid it, since users really require this functionality.

@arunsrinivasan
Copy link
Member Author

Thinking a bit more about this, I think, function like each() (from plyr) might do the job..

dt[, lapply(.SD, colwise(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(.SD, rowwise(...))] # Edit: hm.. this isn't quite right, really.

and perhaps these are easy to query optimise internally.

@izahn
Copy link

izahn commented Jul 7, 2015

Thank you for following up @arunsrinivasan. I like the each idea. I'll
follow up with some other ideas when I get back to a computer in the
morning.
On Jul 7, 2015 5:06 PM, "Arun" [email protected] wrote:

Thinking a bit more about this, I think, function like each() (from plyr)
might do the job..

dt[, lapply(.SD, each(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed
to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(unlist(.SD), each(...)), by=1:nrow(dt)]

and perhaps these are easy to query optimise internally.


Reply to this email directly or view it on GitHub
#1063 (comment)
.

@ecoRoland
Copy link

I somewhat agree with @izahn. I think I would prefer if you could use apply for syntax, e.g., dt[, apply(.SD, 2, function(x) c(mean(x), sd(x))), by = z] and dt[, apply(.SD, 1, function(x) c(mean(x), sd(x))), by = z]. apply would need to become a generic with a data.table method, which could then be optimized for specific functions. A slight syntax improvement would be dt[, apply(.SD, 2, list(mean, sd)), by = z], which would deviate only slightly from base `apply´. I don't know how difficult this would be to implement though.

@franknarf1
Copy link
Contributor

@ecoRoland apply(.SD,1,function(x) ...) is somewhat limiting, since it implies that all columns are converted to the same class (so you can't have x[1] be character and x[-1] numeric, for example). Even if that were left out of apply.data.table (so that the function could act on a list instead of an atomic vector), I feel like that would be too big a departure from base R.

I'd rather see (more) optimization behind the scenes (as is already done for mean, etc.) and fewer new functions. I agree that a multiple-function version of lapply would be great, but

  • colwise essentially is lapply and
  • rowwise could be triggered by by=.I.

@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone Apr 10, 2016
@FSantosCodes
Copy link

Dear Arun,
Definitely these functions in data.table will be a great asset. I applied MatrixStats to process median from a remote sensing time series but conversion to matrix is computing demanding and memory costly (specially if you consider parallelization). Moreover, MatrixStats do not compute pretty well quantiles or interquartiles, so I have to use other library from the bioconductor repository called WGCNA. It´s function 'rowQuantileC' is quite fast and efficient (can manage NA values) but again conversion to matrix is a pitfall. In my belief, these rowise functions should be programmed in C as base functions in R can´t manage it efficiently (i.e. millons of columns and rows * by its multiple dimensions), which data.table can do it efficiently.

@jangorecki
Copy link
Member

jangorecki commented Feb 7, 2019

apply would need to become a generic with a data.table method

this is a matter of substituting apply call to our internal rowwise.

rowwise could be triggered by by=.I.

what if j does not use multiple columns in single function call, like j=.(v1=sum(v1), v2=mean(v2)), I know it doesn't make sense for by=.I but still is valid query that should not be optimized to rowwise. While that one should be j=.(v1_v2=sum(v1, v2).

In terms of API the simplest but still usable would be to add new function, lets say rowapply (we could catch apply(MARGIN=1) and redirect) which would be well optimized for common functions. It would be tricky to make it work for arbitrary R function, as we don't know if that functions accept vector or a list. In first case all values has to be coerced to same type and copied into new vector. In latter case it could be eventually referenced. But how we can know if function expects a list or vector? lapply doesn't have to deal with different data types.

How rowwise/rowapply should work when no by specified?

dt[, rowapply(.SD, sum), .SDcols=v1:v2]
dt[, v1+v2]

?

IMO most of rowwise questions could be better answered by melt followed by grouping.

@MichaelChirico
Copy link
Member

Hey @jangorecki how easy would it be to just wrap this into roll functionalities? with window size 0 maybe?

@franknarf1
Copy link
Contributor

Another example for colwise from SO: https://stackoverflow.com/questions/57386580/what-is-the-equivalent-of-mutate-at-dplyr-in-data-table OP wants to apply multiple functions to a set of columns and have the results appear in a particular order with a particular naming convention

@jangorecki jangorecki mentioned this issue May 15, 2020
14 tasks
@jangorecki jangorecki removed the High label Jun 3, 2020
@matthiaskaeding
Copy link

matthiaskaeding commented Dec 18, 2020

One suggestion from a user: It would be super convenient if one could apply one function to a group of variables and another function to another group; similar to this stata syntax:

collapse (mean) var1 var2 (sum) var3 var4, by(group)

My suggestion would be to allow the .SDcols argument to take a list, similar to the measure.vars argument in melt.data.table. Maybe this could also work with the patterns function.

I.e., like this:

D[, lapply(.SD, colwise(mean, sum)), .SDcols = .(patterns("x"), patterns("y"))]

@jangorecki
Copy link
Member

jangorecki commented Dec 18, 2020

Your proposed syntax diverges from base R syntax too much IMO.
Wouldn't that do?

x_cols = grep(...)
y_cols = grep(...)
D[, c(lapply(.SD[,x_cols], mean), lapply(.SD[,y_cols], sum)), .SDcols = c(x_cols, y_cols)]

No new magic needed inside DT, and magic usually is at the cost of consistency.

@matthiaskaeding
Copy link

matthiaskaeding commented Dec 18, 2020

Yes, but I thought sub setting columns in .SD is supposed to be avoided and this might be cumbersome for more than 2 functions.

However, I don't feel qualified to comment on the consistency issue :0

@jangorecki
Copy link
Member

For more functions it can be made using helper function.
If you are worried about overhead of subsetting on .SD, you can do like this

D[, {
  sd = unclass(.SD)
  c(lapply(sd[x_cols], mean), lapply(sd[y_cols], sum))
}, .SDcols = c(x_cols, y_cols)]

@myoung3

This comment has been minimized.

@franknarf1

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

10 participants