-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update comparisons with data.table info #2725
Conversation
Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)
Thank you! |
Soem adjustments for data.table
fix table spacing and response to general comments.
@eloualiche + @grantmcdermott : as a general comment my preference would be to use the syntax that data.table users would find natural (I guess this follows your recommendations, but I just wanted to highlight this). This section is intended to help users who know data.tables understand how to write an equivalent code in DataFrames.jl. Thank you for working on this! |
Semijoin. Also added snippet of R code for creating the relevant datasets (following panda and dplyr)
Spaces. Spaces. You are telling me about spaces? Spaces? Spaces!
More spaces!
Have you considered writing e.g. |
We discussed this above. I just want to avoid aliases. This is not a guide for people wanting to switch to data.table, but rather targeted at people knowledgeable with data.table who would like to also use to DataFrames. |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
Is there still something in the way of merging this? lmk. |
Maybe the main developers of data.table will like chime in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for pinging us. LGTM. Just a few suggested changes in this review.
Aside: on spacing, I tend to think that operators like =
already visually separate alphanumeric characters so adding space around operators isn't necessary. In fact, to my eyes at least, the spaces lose some utility when used too much (i.e. in the way that is popular these days). I increase the utility of spaces by placing them between logical parts, and not always following a single rule. How I use spaces depends on the particular line of code, but I'm trying to group together parts. And sometimes vertically aligning common parts of two lines so you can see the differences between the two lines. So I've removed some spaces in the suggestions in this review, but the same thinking could be applied to the other queries too.
For example df[, x:=NULL]
uses one space to separate the filter part (before the comma) from the doing part (the j
). There isn't any need for spaces around :=
here because the non-alphanumeric characters in :=
are already are distinguished from x
and NULL
. I think the whole spaces around operators thing came about because "use <-
" was promoted, but I really prefer =
.
Anyway, the spaces thinking is why I prefer .()
to list()
. I saw your comments about list
and .()
so I'm adding a bit more background. list
is a set of 4 alphabetic characters which my brain has to process and realize that it's just there to permit a set of things; i.e., it's not a function or anything import. Whereas .()
is more like a space that groups things together. When list()
is used in all three parts inside [...]
then that's 9 extra characters vs using .()
. Those extra characters are alphabetic characters too, which make it slightly harder for the actual functions to jump out at the reader.
I made the changes in this review from list()
to .()
in these suggestions where there is another suggestion too, but all the list()
could be changed to .()
.
Thanks again for pinging us! Much appreciated.
| Transform several columns | `df[, list(max(x), min(y)) ]` | `combine(df, :x => maximum, :y => minimum)` | | ||
| | `df[, lapply(.SD, mean), .SDcols = c("x", "y") ]` | `combine(df, [:x, :y] .=> mean)` | | ||
| | `df[, lapply(.SD, mean), .SDcols = patterns("x*") ]` | `combine(df, names(df, r"^x") .=> mean)` | | ||
| | `df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cheating a bit — or maybe not — but I'd probably use dcast
here.
dcast(df, .~., fun=list(min, max), value.var = c('x', 'y'))
The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.
dcast(df, grp~., fun=list(min, max), value.var = c('x', 'y'))
Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my dcast
workflow).
combine(groupby(df, :grp), ([:x, :y] .=> [minimum maximum])...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]
. That should GForce
as well where the unlist
one will not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it doesn't use GForce, and also, it results in duplicate names! ouch
docs/src/man/comparisons.md
Outdated
| Multivariate function | `df[, list(cor(x,y)) ]` | `transform(df, [:x, :y] => cor)` | | ||
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` | | ||
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` | | ||
| DataFrame as output | `df[, .SD[, list(value = c(min(x), max(x)) )] ]` | `combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df[, .(c(min(x), max(x)))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to show a typical use case for .SD
I typically find it useful for non-standard function that I have to execute by groups (a regression for example). But that's the best example I could come up with.
If you have an idea, let me know; if not I will likely remove this row because this has generated some confusion on the goal before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. One use-case for .SD
that springs to mind is DT[, .SD[1], by=group]
; i.e. return the first row of every group (change 1
to .N
for last row instead). Or, DT[, .SD[which.max(someCol)], by=group]
; i.e. return the row in each group that has the biggest value in some column.
.SD
is really for use together with grouping to do a sub-query within each group. I've noticed people using .SD
with no grouping present, and that's generally a red flag that a much simpler way is idiomatic. Originally, iirc, .SD
only even worked if grouping clause was present, by design. But then folk had programmatic code that sometimes grouped and sometimes didn't, and for learning purposes too, so for consistency they asked for .SD
to work even when no grouping was present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think I confused myself with the dataframes syntax. Here are your examples, it would be great if some of the dataframes could tell me if this is idiomatic
First element by group (usually comes after a sort)
df[, .SD[1], by=grp]
combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))
Or max by group:
df[, .SD[which.max(x)], by=grp]
subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )
I am not sure the last one is the best way of doing this. But that is also a use case for subset by group (which @matthieugomez raised here recently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))
is just combine(groupby(df, :grp), first)
subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )
is probably more natural to write as combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])
(note that in both cases the approach in data.table and in DataFrames.jl is conceptually the same (I have learned something 😄) only the syntax is a bit different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's amazing. I had no idea about sdf
. Is this a reserved variable that stands for the df for each group within the scope of combine? I did not see it documented.
That's exactly the kind of things I wanted to show. This will make the guide extra useful!
Just to be clear on the syntax. Does it mean I could have:
combine(groupby(df, :grp), sdf -> sdf[1, :])
for the first example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the syntax x -> f(x)
is used for creating an anonymous function. So sdf
is argument to a lamdba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine(groupby(df, :grp), sdf -> sdf[1, :])
is OK and it is the same as combine(groupby(df, :grp), first)
, as first
is just defined as:
Base.first(df::AbstractDataFrame) = df[1, :]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to expand on what @pdeffebach commented. We often use sdf
as a variable name to signal that this is a view of an original data frame, which is of SubDataFrame
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I got totally confused by the lambda argument name and thought somehow this was a special variable much like .SD
I am still amazed that this ends up being so easy. Kudos to DataFrames!
Change list to dot notation. Better example of .SD in data.table and how this could be converted in julia.
showcase sdf in comparison with data.table
docs/src/man/comparisons.md
Outdated
| Add new columns | `df[, x_mean:=mean(x) ]` | `transform(df, :x => mean => :x_mean)` | | ||
| Rename column (in place) | `setnames(df, "x", "x_new")` | `rename!(df, :x => :x_new)` | | ||
| Rename multiple columns (in place) | `setnames(df, c("x", "y"), c("x_new", "y_new"))` | `rename!(df, [:x, :y] .=> [:x_new, :y_new])` | | ||
| Pick columns | `df[, .(x, y)]` | `select(df, :x, :y)` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a way of taking a single column as a vector?
I think it is sometimes useful and nicely corresponds between data.table
and DataFrames.jl
.
I'm only not sure either we should use here !
or :
, but I think view should be preferred.
| Pick columns as dataframe | `df[, .(x, y)]` | `select(df, :x, :y)` |
| Pick column as a vector | `df[, x]` | `df[!, :x]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! Implemented.
Pick columns as vector
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` | | ||
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` | | ||
| DataFrame as output | `df[, .SD[1], by=grp]` | `combine(groupby(df, :grp), first)` | | ||
| DataFrame as output | `df[, .SD[which.max(x)], by=grp]` | `combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I would usually do this as df[order(-x), .SD[1], by=grp]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. For this use case, it is probably more idiomatic.
However, the goal here is to showcase a function that uses subdataframes. I am afraid that if we only use first
and .SD[1]
, this might seem more limited than using actual function on .SD
.
If you have an other example to showcase using functions on .SD
, I will be happy to take it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! FWIW I ran into the exact same issue writing the .SD vignette:
fix in place typo and regex
Hi all, > dt = data.table(id = c(rep(1,3),rep(2,2)),it = c(1:3,1:2), y = rnorm(n=5))
> dt
id it y
1: 1 1 -0.9275339
2: 1 2 -0.3396562
3: 1 3 -1.2770334
4: 2 1 -1.2831495
5: 2 2 -0.2610641
> dt[ , Ly := shift(y, type = "lag"), by = id]
> dt
id it y Ly
1: 1 1 -0.9275339 NA
2: 1 2 -0.3396562 -0.9275339
3: 1 3 -1.2770334 -0.3396562
4: 2 1 -1.2831495 NA
5: 2 2 -0.2610641 -1.2831495
julia> df = DataFrame(id = [1,1,1,2,2],it = [1,2,3,1,2], y = rand(5))
5×3 DataFrame
Row │ id it y
│ Int64 Int64 Float64
─────┼─────────────────────────
1 │ 1 1 0.0700678
2 │ 1 2 0.0675774
3 │ 1 3 0.926906
4 │ 2 1 0.300906
5 │ 2 2 0.641277
julia> using ShiftedArrays
julia> transform!(groupby(df,:id), :y => lag => :Ly)
5×4 DataFrame
Row │ id it y Ly
│ Int64 Int64 Float64 Float64?
─────┼──────────────────────────────────────────
1 │ 1 1 0.0700678 missing
2 │ 1 2 0.0675774 0.0700678
3 │ 1 3 0.926906 0.0675774
4 │ 2 1 0.300906 missing
5 │ 2 2 0.641277 0.300906
|
@eloualiche - I think it is high time to merge this PR and let the community review the merged docs on Given this if you could please kindly:
Then I will do a final review and merge. OK? |
Thanks for chiming in, @floswald. I don't use data.table shift because I was never sure how it handles different date formats (monthly/quarterly etc.), something that @matthieugomez solved in statar. I am happy to include this but we need to solve two things:
|
Given the complications - let us leave this out. I propose to resolve merge conflicts and then I merge this PR as is. Then I propose to shift a discussion on lagging to a separate PR as it looks as a more complex issue. OK? |
sounds good ! will you open an issue @bkamins or how shall this proceed? |
Yes - please open a new PR after this is merged (so that you do not have to rebase it later). Thank you. |
I think I resolved merge conflicts. This should be all set now. Waiting for @floswald PR to revisit it! |
I have made two fixes as I assume the |
Thank you! |
@floswald @pdeffebach @nalimilan - I have merged this to move forward. If you find something on the "comparisons" page on |
Update to #2388, with comparison with data.table syntax. Focus on in place changes (!)