Set up rowwise() and colwise() functions for use in .SD #1063

arunsrinivasan · 2015-03-04T22:00:07Z

The idea is to implement all possible functions in C to directly operate on lists. To begin with, maybe:

Implementations should be for both row- and column-wise for lists/data.tables.

This'll enable us to:

DT[, rowwise(.SD, <functions>)]
DT[, colwise(.SD, <functions>)]

This will:

overcome the current limitation of lapply which makes it tedious to aggregate using multiple functions.
provide both row- and col-wise operations for data.tables, where most common functions will be implemented and query optimised to use them automatically.

The text was updated successfully, but these errors were encountered:

izahn · 2015-07-07T13:50:12Z

I'm not convinced this is needed, especially not col-wise. lapply already works well, I don't think we need another wheel for that. For row-wise we already have apply, rowStats in the fUtilities package, row* functions in the matrixStats package, etc. etc. I don't think this needs to be reinvented in data.table.

arunsrinivasan · 2015-07-07T15:11:06Z

@izahn nobody is reinventing anything here. It is a necessity. I'm not sure why you're getting worked up.

rowwise() - apply() doesn't cut it because it converts the input object to a matrix first. And that's an absolute waste. We want to be able to do things quite efficiently. And definitely anything is more efficient than having to allocate memory to just create a matrix out of the same data!
rowStats() - I'm not aware of this package. Good to know. But if it works on matrices, then it's a no-go as well. Because it comes back to (1). And even if it works on data.frames, then the issue is that we won't be able to escape eval() from C-side by using those functions from other packages. Evaluating functions for each row is costly. And we'd most certainly want to avoid it.
colwise() - data.table has always tried to use base R functions whenever possible. That is the reason why we have not had any such functions implemented until now. But there have been feature requests / questions for functionality like dplyr::summarise_each(). There's no equivalent for this in base R. No, lapply() and mapply() / Map() both don't cut it. Unless there is a way to apply multiple functions, each, to multiple columns using base R's apply family, that looks as clean as a simple lapply(), there seems to be no reason to not implement this functionality.

Even here, avoiding eval() cost is a priority. The GForce family of functions in data.table (inspired by dplyr's hybrid evaluation) or the hybrid evaluation family of functions in dplyr do precisely that. And having own versions also helps parallelising at a later point. (Something that Romain touched upon in his keynote speech at UseR'15). We'd like to extend it to more common functions so that the implementations are as efficient as possible. This is tied closely to the philosophy of data.table.

Out of curiosity, do you use data.table package directly or through dplyr()? If you do use directly, I can't think of a reason why you'd not want this feature.. :-O

izahn · 2015-07-07T16:34:28Z

@arunsrinivasan I'm not worked up at all, sorry if I did something to give you that impression. As background (and because you asked about dplyr), I was a happy user of plyr but when dplyr came around I was dismayed by the direction it took of completely re-implementing data manipulation in R. I much prefer to the data.table approach of sticking with idioms that are applicable everywhere rather than the walled garden approach of dplyr. I'm just trying to encourage data.table developers not to go down the dplyr road of creating suites of functions that only work well inside the workflow dictated by the package, and to encourage reuse of existing functions and idioms.

arunsrinivasan · 2015-07-07T19:38:01Z

@izahn thanks for clarifying. I really read your previous message differently . Sorry about that.

I agree with you entirely on (not forcing) the walled garden approach. Unfortunately there's not a clean equivalent for applying multiple functions to multiple columns each.. (or I'm at least not aware of it). For example:

dt[, c(lapply(.SD, mean), lapply(.SD, sum)), by=z]

requires good knowledge of base R (which I don't think everyone cares for these days). And then there's the readability issue that people are quite worried about these days (as opposed to understandability).

I'll try asking on SO or R-help if there's a way using base-R (or let me know if you can think of it). Otherwise, I'm not sure if it's possible to avoid it, since users really require this functionality.

arunsrinivasan · 2015-07-07T21:06:46Z

Thinking a bit more about this, I think, function like each() (from plyr) might do the job..

dt[, lapply(.SD, colwise(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(.SD, rowwise(...))] # Edit: hm.. this isn't quite right, really.

and perhaps these are easy to query optimise internally.

izahn · 2015-07-07T21:36:29Z

Thank you for following up @arunsrinivasan. I like the each idea. I'll
follow up with some other ideas when I get back to a computer in the
morning.
On Jul 7, 2015 5:06 PM, "Arun" [email protected] wrote:

Thinking a bit more about this, I think, function like each() (from plyr)
might do the job..

dt[, lapply(.SD, each(sum, mean)), by=z]

for example. The extra arguments can go after each and that'll be passed
to all functions.

In this case, the rowwise() reduces to:

dt[, lapply(unlist(.SD), each(...)), by=1:nrow(dt)]

and perhaps these are easy to query optimise internally.

—
Reply to this email directly or view it on GitHub
#1063 (comment)
.

ecoRoland · 2015-07-08T07:19:51Z

I somewhat agree with @izahn. I think I would prefer if you could use apply for syntax, e.g., dt[, apply(.SD, 2, function(x) c(mean(x), sd(x))), by = z] and dt[, apply(.SD, 1, function(x) c(mean(x), sd(x))), by = z]. apply would need to become a generic with a data.table method, which could then be optimized for specific functions. A slight syntax improvement would be dt[, apply(.SD, 2, list(mean, sd)), by = z], which would deviate only slightly from base `apply´. I don't know how difficult this would be to implement though.

franknarf1 · 2015-07-08T13:12:13Z

@ecoRoland apply(.SD,1,function(x) ...) is somewhat limiting, since it implies that all columns are converted to the same class (so you can't have x[1] be character and x[-1] numeric, for example). Even if that were left out of apply.data.table (so that the function could act on a list instead of an atomic vector), I feel like that would be too big a departure from base R.

I'd rather see (more) optimization behind the scenes (as is already done for mean, etc.) and fewer new functions. I agree that a multiple-function version of lapply would be great, but

colwise essentially is lapply and
rowwise could be triggered by by=.I.

FSantosCodes · 2017-11-23T20:09:14Z

Dear Arun,
Definitely these functions in data.table will be a great asset. I applied MatrixStats to process median from a remote sensing time series but conversion to matrix is computing demanding and memory costly (specially if you consider parallelization). Moreover, MatrixStats do not compute pretty well quantiles or interquartiles, so I have to use other library from the bioconductor repository called WGCNA. It´s function 'rowQuantileC' is quite fast and efficient (can manage NA values) but again conversion to matrix is a pitfall. In my belief, these rowise functions should be programmed in C as base functions in R can´t manage it efficiently (i.e. millons of columns and rows * by its multiple dimensions), which data.table can do it efficiently.

MichaelChirico · 2017-11-24T02:14:56Z

update when added:

jangorecki · 2019-02-07T16:02:39Z

apply would need to become a generic with a data.table method

this is a matter of substituting apply call to our internal rowwise.

rowwise could be triggered by by=.I.

what if j does not use multiple columns in single function call, like j=.(v1=sum(v1), v2=mean(v2)), I know it doesn't make sense for by=.I but still is valid query that should not be optimized to rowwise. While that one should be j=.(v1_v2=sum(v1, v2).

In terms of API the simplest but still usable would be to add new function, lets say rowapply (we could catch apply(MARGIN=1) and redirect) which would be well optimized for common functions. It would be tricky to make it work for arbitrary R function, as we don't know if that functions accept vector or a list. In first case all values has to be coerced to same type and copied into new vector. In latter case it could be eventually referenced. But how we can know if function expects a list or vector? lapply doesn't have to deal with different data types.

How rowwise/rowapply should work when no by specified?

dt[, rowapply(.SD, sum), .SDcols=v1:v2]
dt[, v1+v2]

?

IMO most of rowwise questions could be better answered by melt followed by grouping.

MichaelChirico · 2019-07-20T18:46:23Z

Hey @jangorecki how easy would it be to just wrap this into roll functionalities? with window size 0 maybe?

franknarf1 · 2019-08-07T05:15:25Z

Another example for colwise from SO: https://stackoverflow.com/questions/57386580/what-is-the-equivalent-of-mutate-at-dplyr-in-data-table OP wants to apply multiple functions to a set of columns and have the results appear in a particular order with a particular naming convention

matthiaskaeding · 2020-12-18T15:33:14Z

One suggestion from a user: It would be super convenient if one could apply one function to a group of variables and another function to another group; similar to this stata syntax:

collapse (mean) var1 var2 (sum) var3 var4, by(group)

My suggestion would be to allow the .SDcols argument to take a list, similar to the measure.vars argument in melt.data.table. Maybe this could also work with the patterns function.

I.e., like this:

D[, lapply(.SD, colwise(mean, sum)), .SDcols = .(patterns("x"), patterns("y"))]

jangorecki · 2020-12-18T18:26:59Z

Your proposed syntax diverges from base R syntax too much IMO.
Wouldn't that do?

x_cols = grep(...)
y_cols = grep(...)
D[, c(lapply(.SD[,x_cols], mean), lapply(.SD[,y_cols], sum)), .SDcols = c(x_cols, y_cols)]

No new magic needed inside DT, and magic usually is at the cost of consistency.

matthiaskaeding · 2020-12-18T21:54:24Z

Yes, but I thought sub setting columns in .SD is supposed to be avoided and this might be cumbersome for more than 2 functions.

However, I don't feel qualified to comment on the consistency issue :0

jangorecki · 2020-12-19T09:14:25Z

For more functions it can be made using helper function.
If you are worried about overhead of subsetting on .SD, you can do like this

D[, {
  sd = unclass(.SD)
  c(lapply(sd[x_cols], mean), lapply(sd[y_cols], sum))
}, .SDcols = c(x_cols, y_cols)]

arunsrinivasan added feature request High labels Mar 4, 2015

arunsrinivasan mentioned this issue Mar 5, 2015

Better interface for multiple function calls in j + lapply(.SD) #782

Closed

This comment has been minimized.

Sign in to view

arunsrinivasan added this to the v1.9.8 milestone Jul 7, 2015

arunsrinivasan removed this from the v1.9.8 milestone Apr 10, 2016

leoluyi mentioned this issue Jun 26, 2018

Row operations in data.table using by = .I #1732

Closed

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

75 tasks

franknarf1 mentioned this issue Jul 18, 2019

summarise_all requested #3711

Open

jangorecki mentioned this issue May 15, 2020

psum / pprod #4448

Draft

14 tasks

jangorecki removed the High label Jun 3, 2020

myoung3 mentioned this issue Jan 31, 2021

across() needs to ignore grouping vars tidyverse/dtplyr#173

Closed

This comment has been minimized.

Sign in to view

grantmcdermott mentioned this issue May 22, 2021

Passing named lists to .SDcols / .SD #5020

Open

myoung3 mentioned this issue Jul 1, 2021

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] #4970

Open

ben-schwen mentioned this issue Feb 17, 2022

Fast 'groups' of individual rows #1004

Closed

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up rowwise() and colwise() functions for use in .SD #1063

Set up rowwise() and colwise() functions for use in .SD #1063

arunsrinivasan commented Mar 4, 2015

This comment has been minimized.

izahn commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

izahn commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

izahn commented Jul 7, 2015

ecoRoland commented Jul 8, 2015

franknarf1 commented Jul 8, 2015

FSantosCodes commented Nov 23, 2017

MichaelChirico commented Nov 24, 2017 •

edited

Loading

jangorecki commented Feb 7, 2019 •

edited

Loading

MichaelChirico commented Jul 20, 2019

franknarf1 commented Aug 7, 2019

matthiaskaeding commented Dec 18, 2020 •

edited

Loading

jangorecki commented Dec 18, 2020 •

edited

Loading

matthiaskaeding commented Dec 18, 2020 •

edited

Loading

jangorecki commented Dec 19, 2020

This comment has been minimized.

This comment has been minimized.

Set up rowwise() and colwise() functions for use in .SD #1063

Set up rowwise() and colwise() functions for use in .SD #1063

Comments

arunsrinivasan commented Mar 4, 2015

This comment has been minimized.

izahn commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

izahn commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

arunsrinivasan commented Jul 7, 2015

izahn commented Jul 7, 2015

ecoRoland commented Jul 8, 2015

franknarf1 commented Jul 8, 2015

FSantosCodes commented Nov 23, 2017

MichaelChirico commented Nov 24, 2017 • edited Loading

jangorecki commented Feb 7, 2019 • edited Loading

MichaelChirico commented Jul 20, 2019

franknarf1 commented Aug 7, 2019

matthiaskaeding commented Dec 18, 2020 • edited Loading

jangorecki commented Dec 18, 2020 • edited Loading

matthiaskaeding commented Dec 18, 2020 • edited Loading

jangorecki commented Dec 19, 2020

This comment has been minimized.

This comment has been minimized.

MichaelChirico commented Nov 24, 2017 •

edited

Loading

jangorecki commented Feb 7, 2019 •

edited

Loading

matthiaskaeding commented Dec 18, 2020 •

edited

Loading

jangorecki commented Dec 18, 2020 •

edited

Loading

matthiaskaeding commented Dec 18, 2020 •

edited

Loading