Update comparisons with data.table info #2725

eloualiche · 2021-04-17T17:15:09Z

Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)

docs/src/man/comparisons.md

bkamins · 2021-04-17T17:46:19Z

Thank you!

Soem adjustments for data.table

docs/src/man/comparisons.md

fix table spacing and response to general comments.

bkamins · 2021-04-18T07:25:20Z

@eloualiche + @grantmcdermott : as a general comment my preference would be to use the syntax that data.table users would find natural (I guess this follows your recommendations, but I just wanted to highlight this). This section is intended to help users who know data.tables understand how to write an equivalent code in DataFrames.jl. Thank you for working on this!

Semijoin. Also added snippet of R code for creating the relevant datasets (following panda and dplyr)

docs/src/man/comparisons.md

Spaces. Spaces. You are telling me about spaces? Spaces? Spaces!

docs/src/man/comparisons.md

More spaces!

pstorozenko · 2021-04-20T17:04:36Z

Have you considered writing e.g. .(m = mean(x)) instead of list(m = mean(x)) in data.table's j and by?
Even though .() is an alias for list(), I find it more how I'd write it and that's how they use it in tutorial.
Just an idea :)

eloualiche · 2021-04-20T19:10:01Z

Have you considered writing e.g. .(m = mean(x)) instead of list(m = mean(x)) in data.table's j and by?
Even though .() is an alias for list(), I find it more how I'd write it and that's how they use it in tutorial.
Just an idea :)

We discussed this above. I just want to avoid aliases. This is not a guide for people wanting to switch to data.table, but rather targeted at people knowledgeable with data.table who would like to also use to DataFrames.
I personally never use list, but I also find the dot syntax confusing when I am not the one writing it.

Co-authored-by: Milan Bouchet-Valat <[email protected]>

eloualiche · 2021-04-27T12:37:21Z

Is there still something in the way of merging this? lmk.

docs/src/man/comparisons.md

eloualiche · 2021-04-27T16:09:56Z

OK - looks good, so let us wait for data.table experts to comment if they have anything to add. Thank you!

Maybe the main developers of data.table will like chime in.

cc: @mattdowle @arunsrinivasan @jangorecki @MichaelChirico

mattdowle

Many thanks for pinging us. LGTM. Just a few suggested changes in this review.

Aside: on spacing, I tend to think that operators like = already visually separate alphanumeric characters so adding space around operators isn't necessary. In fact, to my eyes at least, the spaces lose some utility when used too much (i.e. in the way that is popular these days). I increase the utility of spaces by placing them between logical parts, and not always following a single rule. How I use spaces depends on the particular line of code, but I'm trying to group together parts. And sometimes vertically aligning common parts of two lines so you can see the differences between the two lines. So I've removed some spaces in the suggestions in this review, but the same thinking could be applied to the other queries too.
For example df[, x:=NULL] uses one space to separate the filter part (before the comma) from the doing part (the j). There isn't any need for spaces around := here because the non-alphanumeric characters in := are already are distinguished from x and NULL. I think the whole spaces around operators thing came about because "use <-" was promoted, but I really prefer =.
Anyway, the spaces thinking is why I prefer .() to list(). I saw your comments about list and .() so I'm adding a bit more background. list is a set of 4 alphabetic characters which my brain has to process and realize that it's just there to permit a set of things; i.e., it's not a function or anything import. Whereas .() is more like a space that groups things together. When list() is used in all three parts inside [...] then that's 9 extra characters vs using .(). Those extra characters are alphabetic characters too, which make it slightly harder for the actual functions to jump out at the reader.
I made the changes in this review from list() to .() in these suggestions where there is another suggestion too, but all the list() could be changed to .().

Thanks again for pinging us! Much appreciated.

docs/src/man/comparisons.md

mattdowle · 2021-04-28T18:11:51Z

docs/src/man/comparisons.md

+| Transform several columns         | `df[, list(max(x), min(y)) ]`                                                              | `combine(df, :x => maximum, :y => minimum)`                                 |
+|                                   | `df[, lapply(.SD, mean), .SDcols = c("x", "y") ]`                                          | `combine(df, [:x, :y] .=> mean)`                                            |
+|                                   | `df[, lapply(.SD, mean), .SDcols = patterns("x*") ]`                                       | `combine(df, names(df, r"^x") .=> mean)`                                    |
+|                                   | `df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]` | `combine(df, ([:x, :y] .=> [maximum minimum])...)`                          |


Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.

It's cheating a bit — or maybe not — but I'd probably use dcast here.

dcast(df, .~., fun=list(min, max), value.var = c('x', 'y'))

The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.

dcast(df, grp~., fun=list(min, max), value.var = c('x', 'y'))

Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my dcast workflow).

combine(groupby(df, :grp), ([:x, :y] .=> [minimum maximum])...)

I would do df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]. That should GForce as well where the unlist one will not.

I guess it doesn't use GForce, and also, it results in duplicate names! ouch

mattdowle · 2021-04-28T18:13:15Z

docs/src/man/comparisons.md

+| Multivariate function             | `df[, list(cor(x,y)) ]`                                                                    | `transform(df, [:x, :y] => cor)`                                            |
+| Row-wise                          | `df[, min_xy := min(x, y), by = 1:nrow(df)]`                                               | `transform(df, [:x, :y] => ByRow(min))`                                     |
+|                                   | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]`           | `transform(df, AsTable(r"^x") => ByRow(argmax))`                            |
+| DataFrame as output               | `df[, .SD[, list(value = c(min(x), max(x)) )] ]`                                           | `combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)`  |


df[, .(c(min(x), max(x)))]

I was trying to show a typical use case for .SD
I typically find it useful for non-standard function that I have to execute by groups (a regression for example). But that's the best example I could come up with.
If you have an idea, let me know; if not I will likely remove this row because this has generated some confusion on the goal before.

Ok I see. One use-case for .SD that springs to mind is DT[, .SD[1], by=group]; i.e. return the first row of every group (change 1 to .N for last row instead). Or, DT[, .SD[which.max(someCol)], by=group]; i.e. return the row in each group that has the biggest value in some column.
.SD is really for use together with grouping to do a sub-query within each group. I've noticed people using .SD with no grouping present, and that's generally a red flag that a much simpler way is idiomatic. Originally, iirc, .SD only even worked if grouping clause was present, by design. But then folk had programmatic code that sometimes grouped and sometimes didn't, and for learning purposes too, so for consistency they asked for .SD to work even when no grouping was present.

Thanks. I think I confused myself with the dataframes syntax. Here are your examples, it would be great if some of the dataframes could tell me if this is idiomatic

First element by group (usually comes after a sort)

df[, .SD[1], by=grp]

combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

Or max by group:

df[, .SD[which.max(x)], by=grp]

subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

I am not sure the last one is the best way of doing this. But that is also a use case for subset by group (which @matthieugomez raised here recently)

combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

is just combine(groupby(df, :grp), first)

subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

is probably more natural to write as combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])

(note that in both cases the approach in data.table and in DataFrames.jl is conceptually the same (I have learned something 😄) only the syntax is a bit different)

That's amazing. I had no idea about sdf. Is this a reserved variable that stands for the df for each group within the scope of combine? I did not see it documented.
That's exactly the kind of things I wanted to show. This will make the guide extra useful!

Just to be clear on the syntax. Does it mean I could have:
combine(groupby(df, :grp), sdf -> sdf[1, :]) for the first example?

No, the syntax x -> f(x) is used for creating an anonymous function. So sdf is argument to a lamdba.

combine(groupby(df, :grp), sdf -> sdf[1, :]) is OK and it is the same as combine(groupby(df, :grp), first), as first is just defined as:

Base.first(df::AbstractDataFrame) = df[1, :]

Just to expand on what @pdeffebach commented. We often use sdf as a variable name to signal that this is a view of an original data frame, which is of SubDataFrame type.

Thanks. I got totally confused by the lambda argument name and thought somehow this was a special variable much like .SD
I am still amazed that this ends up being so easy. Kudos to DataFrames!

Change list to dot notation. Better example of .SD in data.table and how this could be converted in julia.

showcase sdf in comparison with data.table

pstorozenko · 2021-04-29T21:05:53Z

docs/src/man/comparisons.md

+| Add new columns                    | `df[, x_mean:=mean(x) ]`                         | `transform(df, :x => mean => :x_mean)`       |
+| Rename column (in place)           | `setnames(df, "x", "x_new")`                     | `rename!(df, :x => :x_new)`                  |
+| Rename multiple columns (in place) | `setnames(df, c("x", "y"), c("x_new", "y_new"))` | `rename!(df, [:x, :y] .=> [:x_new, :y_new])` |
+| Pick columns                       | `df[, .(x, y)]`                                  | `select(df, :x, :y)`                         |


How about adding a way of taking a single column as a vector?
I think it is sometimes useful and nicely corresponds between data.table and DataFrames.jl.
I'm only not sure either we should use here ! or :, but I think view should be preferred.

| Pick columns as dataframe | `df[, .(x, y)]` | `select(df, :x, :y)` | | Pick column as a vector | `df[, x]` | `df[!, :x]` |

Great suggestion! Implemented.

Pick columns as vector

docs/src/man/comparisons.md

MichaelChirico · 2021-05-02T09:15:34Z

docs/src/man/comparisons.md

+| Row-wise                          | `df[, min_xy := min(x, y), by = 1:nrow(df)]`                                               | `transform(df, [:x, :y] => ByRow(min))`                                     |
+|                                   | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]`           | `transform(df, AsTable(r"^x") => ByRow(argmax))`                            |
+| DataFrame as output               | `df[, .SD[1], by=grp]`                                                                     | `combine(groupby(df, :grp), first)`                                         |
+| DataFrame as output               | `df[, .SD[which.max(x)], by=grp]`                                                          | `combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])`                  |


FWIW I would usually do this as df[order(-x), .SD[1], by=grp]

Agreed. For this use case, it is probably more idiomatic.
However, the goal here is to showcase a function that uses subdataframes. I am afraid that if we only use first and .SD[1], this might seem more limited than using actual function on .SD.

If you have an other example to showcase using functions on .SD, I will be happy to take it!

great! FWIW I ran into the exact same issue writing the .SD vignette:

https://stackoverflow.com/a/47406952/3576984

fix in place typo and regex

floswald · 2021-05-10T09:38:29Z

Hi all,
here is a late entry that one could think about adding to those examples. For consistency, I guess would have to be added to the other sections (stata etc) as well. It's one of my favourite data.table use-cases. How to add a lagged variable by group to a dataframe , for instance to create a panel dataset which tracks units over time, and you want to have unit i's observation from period t-1. (comes up all the time on SO and discourse)

> dt = data.table(id = c(rep(1,3),rep(2,2)),it = c(1:3,1:2), y = rnorm(n=5))
> dt
   id it          y
1:  1  1 -0.9275339
2:  1  2 -0.3396562
3:  1  3 -1.2770334
4:  2  1 -1.2831495
5:  2  2 -0.2610641
> dt[ , Ly := shift(y, type = "lag"), by = id]
> dt
   id it          y         Ly
1:  1  1 -0.9275339         NA
2:  1  2 -0.3396562 -0.9275339
3:  1  3 -1.2770334 -0.3396562
4:  2  1 -1.2831495         NA
5:  2  2 -0.2610641 -1.2831495

julia> df = DataFrame(id = [1,1,1,2,2],it = [1,2,3,1,2], y = rand(5))
5×3 DataFrame
 Row │ id     it     y         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   1 │     1      1  0.0700678
   2 │     1      2  0.0675774
   3 │     1      3  0.926906
   4 │     2      1  0.300906
   5 │     2      2  0.641277

julia> using ShiftedArrays

julia> transform!(groupby(df,:id), :y => lag => :Ly)
5×4 DataFrame
 Row │ id     it     y          Ly              
     │ Int64  Int64  Float64    Float64?        
─────┼──────────────────────────────────────────
   1 │     1      1  0.0700678  missing         
   2 │     1      2  0.0675774        0.0700678
   3 │     1      3  0.926906         0.0675774
   4 │     2      1  0.300906   missing         
   5 │     2      2  0.641277         0.300906

bkamins · 2021-05-10T10:59:16Z

@eloualiche - I think it is high time to merge this PR and let the community review the merged docs on main (so that other contributors can work on their proposals without introducing merge conflicts).

Given this if you could please kindly:

resolve merge conflicts
comment if you agree to add the lag example (and if yes add it)

Then I will do a final review and merge. OK?

eloualiche · 2021-05-10T12:42:46Z

How to add a lagged variable by group to a dataframe

Thanks for chiming in, @floswald.

I don't use data.table shift because I was never sure how it handles different date formats (monthly/quarterly etc.), something that @matthieugomez solved in statar.
Please correct me if I am wrong here as I have not used shift for dates for a long while and this might have changed.
I think we had a PR on ShiftedArrays at some point to work around dates but I don't think this was incorporated.

I am happy to include this but we need to solve two things:

dates: perhaps we should not insist on the panel feature because panels commonly involve dates.
how to include this new feature: a new section with new dataset, in which case it would probably best to include another language as well (surely adding this to stata can't be hard).

bkamins · 2021-05-10T12:51:41Z

Given the complications - let us leave this out. I propose to resolve merge conflicts and then I merge this PR as is. Then I propose to shift a discussion on lagging to a separate PR as it looks as a more complex issue. OK?

floswald · 2021-05-10T12:54:26Z

sounds good ! will you open an issue @bkamins or how shall this proceed?
thanks

bkamins · 2021-05-10T12:57:59Z

Yes - please open a new PR after this is merged (so that you do not have to rebase it later). Thank you.

eloualiche · 2021-05-13T00:48:59Z

I think I resolved merge conflicts. This should be all set now. Waiting for @floswald PR to revisit it!

docs/src/man/comparisons.md

bkamins · 2021-05-13T06:32:05Z

I have made two fixes as I assume the data.table operation is in-place in these places. If this is correct please commit the changes and I will merge. Thank you!

bkamins · 2021-05-13T13:01:04Z

Thank you!

bkamins · 2021-05-13T13:01:57Z

@floswald @pdeffebach @nalimilan - I have merged this to move forward. If you find something on the "comparisons" page on main that should be fixed could you please open a PR fixing this? Thank you!

Update comparisons with data.table info

0e3f29c

Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)

eloualiche mentioned this pull request Apr 17, 2021

Update comparisons.md #2388

Closed

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Show resolved Hide resolved

Update comparisons.md

f92cd07

Soem adjustments for data.table

nalimilan reviewed Apr 17, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Update comparisons.md

aa7e445

fix table spacing and response to general comments.

Update comparisons.md

6c3e4e0

Semijoin. Also added snippet of R code for creating the relevant datasets (following panda and dplyr)

nalimilan reviewed Apr 18, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Update comparisons.md

3a494dc

Spaces. Spaces. You are telling me about spaces? Spaces? Spaces!

bkamins added the doc label Apr 18, 2021

bkamins added this to the 1.x milestone Apr 18, 2021

nalimilan reviewed Apr 19, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Update comparisons.md

2daee86

More spaces!

Update docs/src/man/comparisons.md

6d9e07f

Co-authored-by: Milan Bouchet-Valat <[email protected]>

bkamins reviewed Apr 27, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed Apr 27, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

mattdowle reviewed Apr 28, 2021

View reviewed changes

mattdowle mentioned this pull request Apr 29, 2021

Implement DT[, across(.SD, fun1, fun2, fun3), by=group] Rdatatable/data.table#4970

Open

eloualiche added 2 commits April 29, 2021 12:55

Update comparisons.md

90343c2

Change list to dot notation. Better example of .SD in data.table and how this could be converted in julia.

Update comparisons.md

a8a08f6

showcase sdf in comparison with data.table

pstorozenko reviewed Apr 29, 2021

View reviewed changes

Update comparisons.md

10bd7c6

Pick columns as vector

MichaelChirico reviewed May 2, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

MichaelChirico reviewed May 2, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

MichaelChirico reviewed May 2, 2021

View reviewed changes

Update comparisons.md

7375117

fix in place typo and regex

grantmcdermott mentioned this pull request May 6, 2021

Style all code samples LOST-STATS/lost-stats.github.io#125

Open

bkamins mentioned this pull request May 8, 2021

Review comparisons with R/Python #2737

Closed

Update comparisons.md

9f9e74e

bkamins reviewed May 13, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

bkamins reviewed May 13, 2021

View reviewed changes

docs/src/man/comparisons.md Outdated Show resolved Hide resolved

Fix ! on inplace operations

bcbd70e

bkamins approved these changes May 13, 2021

View reviewed changes

bkamins merged commit 59e9589 into JuliaData:main May 13, 2021

eloualiche deleted the patch-doc-R branch May 13, 2021 14:56

Update comparisons with data.table info #2725

Update comparisons with data.table info #2725

Conversation

eloualiche commented Apr 17, 2021

bkamins commented Apr 17, 2021

bkamins commented Apr 18, 2021

pstorozenko commented Apr 20, 2021

eloualiche commented Apr 20, 2021

eloualiche commented Apr 27, 2021

eloualiche commented Apr 27, 2021 • edited Loading

mattdowle left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grantmcdermott Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattdowle Apr 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

eloualiche Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins Apr 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eloualiche May 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floswald commented May 10, 2021

bkamins commented May 10, 2021

eloualiche commented May 10, 2021

bkamins commented May 10, 2021

floswald commented May 10, 2021

bkamins commented May 10, 2021

eloualiche commented May 13, 2021 • edited Loading

bkamins commented May 13, 2021

bkamins commented May 13, 2021

bkamins commented May 13, 2021

eloualiche commented Apr 27, 2021 •

edited

Loading

mattdowle left a comment •

edited

Loading

grantmcdermott Apr 29, 2021 •

edited

Loading

mattdowle Apr 28, 2021 •

edited

Loading

bkamins Apr 29, 2021 •

edited

Loading

eloualiche Apr 29, 2021 •

edited

Loading

bkamins Apr 29, 2021 •

edited

Loading

eloualiche May 2, 2021 •

edited

Loading

eloualiche commented May 13, 2021 •

edited

Loading