Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update comparisons with data.table info #2725
Update comparisons with data.table info #2725
Changes from 9 commits
0e3f29c
f92cd07
aa7e445
6c3e4e0
3a494dc
2daee86
6d9e07f
e33e492
65c61f6
90343c2
a8a08f6
10bd7c6
7375117
9f9e74e
bcbd70e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's cheating a bit — or maybe not — but I'd probably use
dcast
here.The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.
Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my
dcast
workflow).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do
df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]
. That shouldGForce
as well where theunlist
one will not.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it doesn't use GForce, and also, it results in duplicate names! ouch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df[, .(c(min(x), max(x)))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to show a typical use case for .SD
I typically find it useful for non-standard function that I have to execute by groups (a regression for example). But that's the best example I could come up with.
If you have an idea, let me know; if not I will likely remove this row because this has generated some confusion on the goal before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. One use-case for
.SD
that springs to mind isDT[, .SD[1], by=group]
; i.e. return the first row of every group (change1
to.N
for last row instead). Or,DT[, .SD[which.max(someCol)], by=group]
; i.e. return the row in each group that has the biggest value in some column..SD
is really for use together with grouping to do a sub-query within each group. I've noticed people using.SD
with no grouping present, and that's generally a red flag that a much simpler way is idiomatic. Originally, iirc,.SD
only even worked if grouping clause was present, by design. But then folk had programmatic code that sometimes grouped and sometimes didn't, and for learning purposes too, so for consistency they asked for.SD
to work even when no grouping was present.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think I confused myself with the dataframes syntax. Here are your examples, it would be great if some of the dataframes could tell me if this is idiomatic
First element by group (usually comes after a sort)
Or max by group:
I am not sure the last one is the best way of doing this. But that is also a use case for subset by group (which @matthieugomez raised here recently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is just
combine(groupby(df, :grp), first)
is probably more natural to write as
combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])
(note that in both cases the approach in data.table and in DataFrames.jl is conceptually the same (I have learned something 😄) only the syntax is a bit different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's amazing. I had no idea about
sdf
. Is this a reserved variable that stands for the df for each group within the scope of combine? I did not see it documented.That's exactly the kind of things I wanted to show. This will make the guide extra useful!
Just to be clear on the syntax. Does it mean I could have:
combine(groupby(df, :grp), sdf -> sdf[1, :])
for the first example?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the syntax
x -> f(x)
is used for creating an anonymous function. Sosdf
is argument to a lamdba.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine(groupby(df, :grp), sdf -> sdf[1, :])
is OK and it is the same ascombine(groupby(df, :grp), first)
, asfirst
is just defined as:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to expand on what @pdeffebach commented. We often use
sdf
as a variable name to signal that this is a view of an original data frame, which is ofSubDataFrame
type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I got totally confused by the lambda argument name and thought somehow this was a special variable much like
.SD
I am still amazed that this ends up being so easy. Kudos to DataFrames!