Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comparisons with data.table info #2725

Merged
merged 15 commits into from
May 13, 2021
Merged

Update comparisons with data.table info #2725

merged 15 commits into from
May 13, 2021

Conversation

eloualiche
Copy link
Contributor

Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)

Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)
@bkamins
Copy link
Member

bkamins commented Apr 17, 2021

Thank you!

Soem adjustments for data.table
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
fix table spacing and response to general comments.
@bkamins
Copy link
Member

bkamins commented Apr 18, 2021

@eloualiche + @grantmcdermott : as a general comment my preference would be to use the syntax that data.table users would find natural (I guess this follows your recommendations, but I just wanted to highlight this). This section is intended to help users who know data.tables understand how to write an equivalent code in DataFrames.jl. Thank you for working on this!

Semijoin. Also added snippet of R code for creating the relevant datasets (following panda and dplyr)
Spaces. Spaces. You are telling me about spaces? Spaces? Spaces!
@bkamins bkamins added the doc label Apr 18, 2021
@bkamins bkamins added this to the 1.x milestone Apr 18, 2021
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
@pstorozenko
Copy link
Contributor

Have you considered writing e.g. .(m = mean(x)) instead of list(m = mean(x)) in data.table's j and by?
Even though .() is an alias for list(), I find it more how I'd write it and that's how they use it in tutorial.
Just an idea :)

@eloualiche
Copy link
Contributor Author

Have you considered writing e.g. .(m = mean(x)) instead of list(m = mean(x)) in data.table's j and by?
Even though .() is an alias for list(), I find it more how I'd write it and that's how they use it in tutorial.
Just an idea :)

We discussed this above. I just want to avoid aliases. This is not a guide for people wanting to switch to data.table, but rather targeted at people knowledgeable with data.table who would like to also use to DataFrames.
I personally never use list, but I also find the dot syntax confusing when I am not the one writing it.

Co-authored-by: Milan Bouchet-Valat <[email protected]>
@eloualiche
Copy link
Contributor Author

Is there still something in the way of merging this? lmk.

@eloualiche
Copy link
Contributor Author

eloualiche commented Apr 27, 2021

OK - looks good, so let us wait for data.table experts to comment if they have anything to add. Thank you!

Maybe the main developers of data.table will like chime in.

cc: @mattdowle @arunsrinivasan @jangorecki @MichaelChirico

Copy link

@mattdowle mattdowle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for pinging us. LGTM. Just a few suggested changes in this review.

Aside: on spacing, I tend to think that operators like = already visually separate alphanumeric characters so adding space around operators isn't necessary. In fact, to my eyes at least, the spaces lose some utility when used too much (i.e. in the way that is popular these days). I increase the utility of spaces by placing them between logical parts, and not always following a single rule. How I use spaces depends on the particular line of code, but I'm trying to group together parts. And sometimes vertically aligning common parts of two lines so you can see the differences between the two lines. So I've removed some spaces in the suggestions in this review, but the same thinking could be applied to the other queries too.
For example df[, x:=NULL] uses one space to separate the filter part (before the comma) from the doing part (the j). There isn't any need for spaces around := here because the non-alphanumeric characters in := are already are distinguished from x and NULL. I think the whole spaces around operators thing came about because "use <-" was promoted, but I really prefer =.
Anyway, the spaces thinking is why I prefer .() to list(). I saw your comments about list and .() so I'm adding a bit more background. list is a set of 4 alphabetic characters which my brain has to process and realize that it's just there to permit a set of things; i.e., it's not a function or anything import. Whereas .() is more like a space that groups things together. When list() is used in all three parts inside [...] then that's 9 extra characters vs using .(). Those extra characters are alphabetic characters too, which make it slightly harder for the actual functions to jump out at the reader.
I made the changes in this review from list() to .() in these suggestions where there is another suggestion too, but all the list() could be changed to .().

Thanks again for pinging us! Much appreciated.

docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
docs/src/man/comparisons.md Outdated Show resolved Hide resolved
| Transform several columns | `df[, list(max(x), min(y)) ]` | `combine(df, :x => maximum, :y => minimum)` |
| | `df[, lapply(.SD, mean), .SDcols = c("x", "y") ]` | `combine(df, [:x, :y] .=> mean)` |
| | `df[, lapply(.SD, mean), .SDcols = patterns("x*") ]` | `combine(df, names(df, r"^x") .=> mean)` |
| | `df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.

Copy link

@grantmcdermott grantmcdermott Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cheating a bit — or maybe not — but I'd probably use dcast here.

dcast(df, .~., fun=list(min, max), value.var = c('x', 'y'))

The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.

dcast(df, grp~., fun=list(min, max), value.var = c('x', 'y'))

Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my dcast workflow).

combine(groupby(df, :grp), ([:x, :y] .=> [minimum maximum])...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]. That should GForce as well where the unlist one will not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't use GForce, and also, it results in duplicate names! ouch

| Multivariate function | `df[, list(cor(x,y)) ]` | `transform(df, [:x, :y] => cor)` |
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` |
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
| DataFrame as output | `df[, .SD[, list(value = c(min(x), max(x)) )] ]` | `combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df[, .(c(min(x), max(x)))]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to show a typical use case for .SD
I typically find it useful for non-standard function that I have to execute by groups (a regression for example). But that's the best example I could come up with.
If you have an idea, let me know; if not I will likely remove this row because this has generated some confusion on the goal before.

Copy link

@mattdowle mattdowle Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. One use-case for .SD that springs to mind is DT[, .SD[1], by=group]; i.e. return the first row of every group (change 1 to .N for last row instead). Or, DT[, .SD[which.max(someCol)], by=group]; i.e. return the row in each group that has the biggest value in some column.
.SD is really for use together with grouping to do a sub-query within each group. I've noticed people using .SD with no grouping present, and that's generally a red flag that a much simpler way is idiomatic. Originally, iirc, .SD only even worked if grouping clause was present, by design. But then folk had programmatic code that sometimes grouped and sometimes didn't, and for learning purposes too, so for consistency they asked for .SD to work even when no grouping was present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I confused myself with the dataframes syntax. Here are your examples, it would be great if some of the dataframes could tell me if this is idiomatic

First element by group (usually comes after a sort)

df[, .SD[1], by=grp]
combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

Or max by group:

df[, .SD[which.max(x)], by=grp]
subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

I am not sure the last one is the best way of doing this. But that is also a use case for subset by group (which @matthieugomez raised here recently)

Copy link
Member

@bkamins bkamins Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

is just combine(groupby(df, :grp), first)

subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

is probably more natural to write as combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])

(note that in both cases the approach in data.table and in DataFrames.jl is conceptually the same (I have learned something 😄) only the syntax is a bit different)

Copy link
Contributor Author

@eloualiche eloualiche Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing. I had no idea about sdf. Is this a reserved variable that stands for the df for each group within the scope of combine? I did not see it documented.
That's exactly the kind of things I wanted to show. This will make the guide extra useful!

Just to be clear on the syntax. Does it mean I could have:
combine(groupby(df, :grp), sdf -> sdf[1, :]) for the first example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the syntax x -> f(x) is used for creating an anonymous function. So sdf is argument to a lamdba.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine(groupby(df, :grp), sdf -> sdf[1, :]) is OK and it is the same as combine(groupby(df, :grp), first), as first is just defined as:

Base.first(df::AbstractDataFrame) = df[1, :]

Copy link
Member

@bkamins bkamins Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to expand on what @pdeffebach commented. We often use sdf as a variable name to signal that this is a view of an original data frame, which is of SubDataFrame type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I got totally confused by the lambda argument name and thought somehow this was a special variable much like .SD
I am still amazed that this ends up being so easy. Kudos to DataFrames!

Change list to dot notation.
Better example of .SD in data.table and how this could be converted in julia.
showcase sdf in comparison with data.table
| Add new columns | `df[, x_mean:=mean(x) ]` | `transform(df, :x => mean => :x_mean)` |
| Rename column (in place) | `setnames(df, "x", "x_new")` | `rename!(df, :x => :x_new)` |
| Rename multiple columns (in place) | `setnames(df, c("x", "y"), c("x_new", "y_new"))` | `rename!(df, [:x, :y] .=> [:x_new, :y_new])` |
| Pick columns | `df[, .(x, y)]` | `select(df, :x, :y)` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a way of taking a single column as a vector?
I think it is sometimes useful and nicely corresponds between data.table and DataFrames.jl.
I'm only not sure either we should use here ! or :, but I think view should be preferred.

| Pick columns as dataframe | `df[, .(x, y)]` | `select(df, :x, :y)` |
| Pick column as a vector   | `df[, x]`       | `df[!, :x]`          |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! Implemented.

Pick columns as vector
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` |
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
| DataFrame as output | `df[, .SD[1], by=grp]` | `combine(groupby(df, :grp), first)` |
| DataFrame as output | `df[, .SD[which.max(x)], by=grp]` | `combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would usually do this as df[order(-x), .SD[1], by=grp]

Copy link
Contributor Author

@eloualiche eloualiche May 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For this use case, it is probably more idiomatic.
However, the goal here is to showcase a function that uses subdataframes. I am afraid that if we only use first and .SD[1], this might seem more limited than using actual function on .SD.

If you have an other example to showcase using functions on .SD, I will be happy to take it!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! FWIW I ran into the exact same issue writing the .SD vignette:

https://stackoverflow.com/a/47406952/3576984

fix in place typo and regex
@floswald
Copy link
Contributor

Hi all,
here is a late entry that one could think about adding to those examples. For consistency, I guess would have to be added to the other sections (stata etc) as well. It's one of my favourite data.table use-cases. How to add a lagged variable by group to a dataframe , for instance to create a panel dataset which tracks units over time, and you want to have unit i's observation from period t-1. (comes up all the time on SO and discourse)

> dt = data.table(id = c(rep(1,3),rep(2,2)),it = c(1:3,1:2), y = rnorm(n=5))
> dt
   id it          y
1:  1  1 -0.9275339
2:  1  2 -0.3396562
3:  1  3 -1.2770334
4:  2  1 -1.2831495
5:  2  2 -0.2610641
> dt[ , Ly := shift(y, type = "lag"), by = id]
> dt
   id it          y         Ly
1:  1  1 -0.9275339         NA
2:  1  2 -0.3396562 -0.9275339
3:  1  3 -1.2770334 -0.3396562
4:  2  1 -1.2831495         NA
5:  2  2 -0.2610641 -1.2831495
julia> df = DataFrame(id = [1,1,1,2,2],it = [1,2,3,1,2], y = rand(5))
5×3 DataFrame
 Row │ id     it     y         
     │ Int64  Int64  Float64   
─────┼─────────────────────────
   11      1  0.0700678
   21      2  0.0675774
   31      3  0.926906
   42      1  0.300906
   52      2  0.641277

julia> using ShiftedArrays

julia> transform!(groupby(df,:id), :y => lag => :Ly)
5×4 DataFrame
 Row │ id     it     y          Ly              
     │ Int64  Int64  Float64    Float64?        
─────┼──────────────────────────────────────────
   11      1  0.0700678  missing         
   21      2  0.0675774        0.0700678
   31      3  0.926906         0.0675774
   42      1  0.300906   missing         
   52      2  0.641277         0.300906

@bkamins
Copy link
Member

bkamins commented May 10, 2021

@eloualiche - I think it is high time to merge this PR and let the community review the merged docs on main (so that other contributors can work on their proposals without introducing merge conflicts).

Given this if you could please kindly:

  1. resolve merge conflicts
  2. comment if you agree to add the lag example (and if yes add it)

Then I will do a final review and merge. OK?

@eloualiche
Copy link
Contributor Author

How to add a lagged variable by group to a dataframe

Thanks for chiming in, @floswald.

I don't use data.table shift because I was never sure how it handles different date formats (monthly/quarterly etc.), something that @matthieugomez solved in statar.
Please correct me if I am wrong here as I have not used shift for dates for a long while and this might have changed.
I think we had a PR on ShiftedArrays at some point to work around dates but I don't think this was incorporated.

I am happy to include this but we need to solve two things:

  1. dates: perhaps we should not insist on the panel feature because panels commonly involve dates.
  2. how to include this new feature: a new section with new dataset, in which case it would probably best to include another language as well (surely adding this to stata can't be hard).

@bkamins
Copy link
Member

bkamins commented May 10, 2021

Given the complications - let us leave this out. I propose to resolve merge conflicts and then I merge this PR as is. Then I propose to shift a discussion on lagging to a separate PR as it looks as a more complex issue. OK?

@floswald
Copy link
Contributor

sounds good ! will you open an issue @bkamins or how shall this proceed?
thanks

@bkamins
Copy link
Member

bkamins commented May 10, 2021

Yes - please open a new PR after this is merged (so that you do not have to rebase it later). Thank you.

@eloualiche
Copy link
Contributor Author

eloualiche commented May 13, 2021

I think I resolved merge conflicts. This should be all set now. Waiting for @floswald PR to revisit it!

@bkamins
Copy link
Member

bkamins commented May 13, 2021

I have made two fixes as I assume the data.table operation is in-place in these places. If this is correct please commit the changes and I will merge. Thank you!

@bkamins bkamins merged commit 59e9589 into JuliaData:main May 13, 2021
@bkamins
Copy link
Member

bkamins commented May 13, 2021

Thank you!

@bkamins
Copy link
Member

bkamins commented May 13, 2021

@floswald @pdeffebach @nalimilan - I have merged this to move forward. If you find something on the "comparisons" page on main that should be fixed could you please open a PR fixing this? Thank you!

@eloualiche eloualiche deleted the patch-doc-R branch May 13, 2021 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants