Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comparisons with data.table info #2725

Merged
merged 15 commits into from
May 13, 2021
62 changes: 61 additions & 1 deletion docs/src/man/comparisons.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,66 @@ The table below compares more advanced commands:
| DataFrame as input | `summarize(df, head(across(), 2))` | `combine(d -> first(d, 2), df)` |
| DataFrame as output | `summarize(df, tibble(value = c(min(x), max(x))))` | `combine(:x => x -> (value = [minimum(x), maximum(x)],), df)` |

## Comparison with the R package data.table

The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).

```R
library(data.table)
df <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
z = c(3:7, NA), id = letters[1:6])
df2 <- data.table(grp=c(1,3), w = c(10,11))
```

| Operation | data.table | DataFrames.jl |
|:-----------------------------------|:-------------------------------------------------|:---------------------------------------------|
| Reduce multiple values | `df[, list(mean(x))]` | `combine(df, :x => mean)` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Add new columns | `df[, x_mean := mean(x) ]` | `transform(df, :x => mean => :x_mean)` |
| Rename column (in place) | `setnames(df, "x", "x_new")` | `rename!(df, :x => :x_new)` |
| Rename multiple columns (in place) | `setnames(df, c("x", "y"), c("x_new", "y_new"))` | `rename!(df, [:x, :y] .=> [:x_new, :y_new])` |
| Pick columns | `df[, list(x, y)]` | `select(df, :x, :y)` |
| Remove columns | `df[, -"x" ]` | `select(df, Not(:x))` |
| Remove columns (in place) | `df[, c("x") := NULL ]` | `select!(df, Not(:x))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Pick & transform columns | `df[, list(mean(x), y)]` | `select(df, :x => mean, :y)` |
| Pick rows | `df[ x >= 1 ]` | `filter(:x => >=(1), df)` |
| Sort rows (in place) | `setorder(df, x)` | `sort!(df, :x)` |
| Sort rows | `df[ order(x) ]` | `sort(df, :x)` |

### Grouping data and aggregation

| Operation | data.table | DataFrames.jl |
|:----------------------------|:--------------------------------------------------|:------------------------------------------|
| Reduce multiple values | `df[, list(mean(x)), by = list(id) ]` | `combine(groupby(df, :id), :x => mean)` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Add new columns (in place) | `df[, x_mean := mean(x), by = list(id) ]` | `transform!(groupby(df, :id), :x => mean)`|
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Pick & transform columns | `df[, list(x_mean = mean(x), y), by = list(id) ]` | `select(groupby(df, :id), :x => mean, :y)`|
eloualiche marked this conversation as resolved.
Show resolved Hide resolved

### More advanced commands

| Operation | data.table | DataFrames.jl |
|:----------------------------------|:-------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| Complex Function | `df[, list(mean(x, na.rm = T)) ]` | `combine(df, :x => x -> mean(skipmissing(x)))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Transform certain rows (in place) | `df[x<=0, x:=0 ]` | `df.x[df.x .<= 0] .= 0` |
| Transform several columns | `df[, list(max(x), min(y)) ]` | `combine(df, :x => maximum, :y => minimum)` |
| | `df[, lapply(.SD, mean), .SDcols = c("x", "y") ]` | `combine(df, [:x, :y] .=> mean)` |
| | `df[, lapply(.SD, mean), .SDcols = patterns("x*") ]` | `combine(df, names(df, r"^x") .=> mean)` |
| | `df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.

Copy link

@grantmcdermott grantmcdermott Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cheating a bit — or maybe not — but I'd probably use dcast here.

dcast(df, .~., fun=list(min, max), value.var = c('x', 'y'))

The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.

dcast(df, grp~., fun=list(min, max), value.var = c('x', 'y'))

Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my dcast workflow).

combine(groupby(df, :grp), ([:x, :y] .=> [minimum maximum])...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]. That should GForce as well where the unlist one will not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't use GForce, and also, it results in duplicate names! ouch

| Multivariate function | `df[, list(cor(x,y)) ]` | `transform(df, [:x, :y] => cor)` |
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| DataFrame as output | `df[, .SD[, list(value = c(min(x), max(x)) )] ]` | `combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df[, .(c(min(x), max(x)))]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to show a typical use case for .SD
I typically find it useful for non-standard function that I have to execute by groups (a regression for example). But that's the best example I could come up with.
If you have an idea, let me know; if not I will likely remove this row because this has generated some confusion on the goal before.

Copy link

@mattdowle mattdowle Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. One use-case for .SD that springs to mind is DT[, .SD[1], by=group]; i.e. return the first row of every group (change 1 to .N for last row instead). Or, DT[, .SD[which.max(someCol)], by=group]; i.e. return the row in each group that has the biggest value in some column.
.SD is really for use together with grouping to do a sub-query within each group. I've noticed people using .SD with no grouping present, and that's generally a red flag that a much simpler way is idiomatic. Originally, iirc, .SD only even worked if grouping clause was present, by design. But then folk had programmatic code that sometimes grouped and sometimes didn't, and for learning purposes too, so for consistency they asked for .SD to work even when no grouping was present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think I confused myself with the dataframes syntax. Here are your examples, it would be great if some of the dataframes could tell me if this is idiomatic

First element by group (usually comes after a sort)

df[, .SD[1], by=grp]
combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

Or max by group:

df[, .SD[which.max(x)], by=grp]
subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

I am not sure the last one is the best way of doing this. But that is also a use case for subset by group (which @matthieugomez raised here recently)

Copy link
Member

@bkamins bkamins Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine(groupby(df, :grp), names(df) .=> (x->x[1]) .=> names(df))

is just combine(groupby(df, :grp), first)

subset(groupby(df, :grp), :x => (x->(x.==maximum(x))) )

is probably more natural to write as combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])

(note that in both cases the approach in data.table and in DataFrames.jl is conceptually the same (I have learned something 😄) only the syntax is a bit different)

Copy link
Contributor Author

@eloualiche eloualiche Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's amazing. I had no idea about sdf. Is this a reserved variable that stands for the df for each group within the scope of combine? I did not see it documented.
That's exactly the kind of things I wanted to show. This will make the guide extra useful!

Just to be clear on the syntax. Does it mean I could have:
combine(groupby(df, :grp), sdf -> sdf[1, :]) for the first example?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the syntax x -> f(x) is used for creating an anonymous function. So sdf is argument to a lamdba.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine(groupby(df, :grp), sdf -> sdf[1, :]) is OK and it is the same as combine(groupby(df, :grp), first), as first is just defined as:

Base.first(df::AbstractDataFrame) = df[1, :]

Copy link
Member

@bkamins bkamins Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to expand on what @pdeffebach commented. We often use sdf as a variable name to signal that this is a view of an original data frame, which is of SubDataFrame type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I got totally confused by the lambda argument name and thought somehow this was a special variable much like .SD
I am still amazed that this ends up being so easy. Kudos to DataFrames!


### Joining data frames

| Operation | data.table | DataFrames.jl |
|:----------------------|:------------------------------------------------|:--------------------------------|
| Inner join | `merge(df, df2, on = "grp")` | `innerjoin(df, df2, on = :grp)` |
| Outer join | `merge(df, df2, all = TRUE, on = "grp")` | `outerjoin(df, df2, on = :grp)` |
| Left join | `merge(df, df2, all.x = TRUE, on = "grp")` | `leftjoin(df, df2, on = :grp)` |
| Right join | `merge(df, df2, all.y = TRUE, on = "grp")` | `rightjoin(df, df2, on = :grp)` |
| Anti join (filtering) | `df[!df2, on = "grp" ]` | `antijoin(df, df2, on = :grp)` |
bkamins marked this conversation as resolved.
Show resolved Hide resolved
| Semi join (filtering) | `merge(df1, df2[, list(grp)])` | `semijoin(df, df2, on = :grp)` |


## Comparison with Stata (version 8 and above)

The following table compares the main functions of DataFrames.jl with Stata:
Expand All @@ -245,7 +305,7 @@ The following table compares the main functions of DataFrames.jl with Stata:
| Add new columns | `egen x_mean = mean(x)` | `transform!(df, :x => mean => :x_mean)` |
| Rename columns | `rename x x_new` | `rename!(df, :x => :x_new)` |
| Pick columns | `keep x y` | `select!(df, :x, :y)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1)` |
| Sort rows | `sort x` | `sort!(df, :x)` |

Note that the suffix `!` (i.e. `transform!`, `select!`, etc) ensures that the operation transforms the dataframe in place, as in Stata
Expand Down