Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update comparisons with data.table info #2725

Merged
merged 15 commits into from
May 13, 2021
66 changes: 65 additions & 1 deletion docs/src/man/comparisons.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,70 @@ The table below compares more advanced commands:
| DataFrame as input | `summarize(df, head(across(), 2))` | `combine(d -> first(d, 2), df)` |
| DataFrame as output | `summarize(df, tibble(value = c(min(x), max(x))))` | `combine(:x => x -> (value = [minimum(x), maximum(x)],), df)` |


## Comparison with the R package data.table

The following table compares the main functions of DataFrames.jl with the R package data.table (version 1.14.1).

```R
library(data.table)
df <- data.table(grp = rep(1:2, 3), x = 6:1, y = 4:9,
z = c(3:7, NA), id = letters[1:6])
df2 <- data.table(grp=c(1,3), w = c(10,11))
```

| Operation | data.table | DataFrames.jl |
|:-----------------------------------|:-------------------------------------------------|:---------------------------------------------|
| Reduce multiple values | `df[, .(mean(x))]` | `combine(df, :x => mean)` |
| Add new columns | `df[, x_mean:=mean(x) ]` | `transform(df, :x => mean => :x_mean)` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| Rename column (in place) | `setnames(df, "x", "x_new")` | `rename!(df, :x => :x_new)` |
| Rename multiple columns (in place) | `setnames(df, c("x", "y"), c("x_new", "y_new"))` | `rename!(df, [:x, :y] .=> [:x_new, :y_new])` |
| Pick columns as dataframe | `df[, .(x, y)]` | `select(df, :x, :y)` |
| Pick column as a vector | `df[, x]` | `df[!, :x]` |
| Remove columns | `df[, -"x"]` | `select(df, Not(:x))` |
| Remove columns (in place) | `df[, x:=NULL]` | `select!(df, Not(:x))` |
| Remove columns (in place) | `df[, c("x", "y"):=NULL]` | `select!(df, Not([:x, :y]))` |
| Pick & transform columns | `df[, .(mean(x), y)]` | `select(df, :x => mean, :y)` |
| Pick rows | `df[ x >= 1 ]` | `filter(:x => >=(1), df)` |
| Sort rows (in place) | `setorder(df, x)` | `sort!(df, :x)` |
| Sort rows | `df[ order(x) ]` | `sort(df, :x)` |

### Grouping data and aggregation

| Operation | data.table | DataFrames.jl |
|:----------------------------|:--------------------------------------------------|:------------------------------------------|
| Reduce multiple values | `df[, mean(x), by=id ]` | `combine(groupby(df, :id), :x => mean)` |
| Add new columns (in place) | `df[, x_mean:=mean(x), by=id]` | `transform!(groupby(df, :id), :x => mean)`|
| Pick & transform columns | `df[, .(x_mean = mean(x), y), by=id]` | `select(groupby(df, :id), :x => mean, :y)`|

### More advanced commands

| Operation | data.table | DataFrames.jl |
|:----------------------------------|:-------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------|
| Complex Function | `df[, .(mean(x, na.rm=TRUE)) ]` | `combine(df, :x => x -> mean(skipmissing(x)))` |
| Transform certain rows (in place) | `df[x<=0, x:=0 ]` | `df.x[df.x .<= 0] .= 0` |
| Transform several columns | `df[, .(max(x), min(y)) ]` | `combine(df, :x => maximum, :y => minimum)` |
| | `df[, lapply(.SD, mean), .SDcols = c("x", "y") ]` | `combine(df, [:x, :y] .=> mean)` |
| | `df[, lapply(.SD, mean), .SDcols = patterns("x*") ]` | `combine(df, names(df, r"^x") .=> mean)` |
| | `df[, unlist(lapply(.SD, function(x) c(max=max(x), min=min(x)))), .SDcols = c("x", "y") ]` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have some work to do on this one. I can't think of an easier way right now. There may be an outstanding issue or pull request, maybe @jangorecki @MichaelChirico recall. I never wanted to encourage wide data, so my focus was on long. But I know people like to go wide like this, perhaps for presenting results in a paper or web page, so this task should be easier.

Copy link

@grantmcdermott grantmcdermott Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's cheating a bit — or maybe not — but I'd probably use dcast here.

dcast(df, .~., fun=list(min, max), value.var = c('x', 'y'))

The advantage of this approach is that it also scales well to cases where you want to collapse by group. I think the 'unlist' approach would struggle here.

dcast(df, grp~., fun=list(min, max), value.var = c('x', 'y'))

Mind you, grouping is something that the DataFrames.jl implementation automatically supports (and, to @mattdowle's point, might be conceptually simpler than my dcast workflow).

combine(groupby(df, :grp), ([:x, :y] .=> [minimum maximum])...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do df[, c(lapply(.SD, max), lapply(.SD, min)), .SDcols = c("x", "y")]. That should GForce as well where the unlist one will not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't use GForce, and also, it results in duplicate names! ouch

| Multivariate function | `df[, .(cor(x,y)) ]` | `transform(df, [:x, :y] => cor)` |
| Row-wise | `df[, min_xy := min(x, y), by = 1:nrow(df)]` | `transform(df, [:x, :y] => ByRow(min))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| | `df[, argmax_xy := which.max(.SD) , .SDcols = patterns("x*"), by = 1:nrow(df) ]` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
eloualiche marked this conversation as resolved.
Show resolved Hide resolved
| DataFrame as output | `df[, .SD[1], by=grp]` | `combine(groupby(df, :grp), first)` |
| DataFrame as output | `df[, .SD[which.max(x)], by=grp]` | `combine(groupby(df, :grp), sdf -> sdf[argmax(sdf.x), :])` |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I would usually do this as df[order(-x), .SD[1], by=grp]

Copy link
Contributor Author

@eloualiche eloualiche May 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For this use case, it is probably more idiomatic.
However, the goal here is to showcase a function that uses subdataframes. I am afraid that if we only use first and .SD[1], this might seem more limited than using actual function on .SD.

If you have an other example to showcase using functions on .SD, I will be happy to take it!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! FWIW I ran into the exact same issue writing the .SD vignette:

https://stackoverflow.com/a/47406952/3576984


### Joining data frames

| Operation | data.table | DataFrames.jl |
|:----------------------|:------------------------------------------------|:--------------------------------|
| Inner join | `merge(df, df2, on = "grp")` | `innerjoin(df, df2, on = :grp)` |
| Outer join | `merge(df, df2, all = TRUE, on = "grp")` | `outerjoin(df, df2, on = :grp)` |
| Left join | `merge(df, df2, all.x = TRUE, on = "grp")` | `leftjoin(df, df2, on = :grp)` |
| Right join | `merge(df, df2, all.y = TRUE, on = "grp")` | `rightjoin(df, df2, on = :grp)` |
| Anti join (filtering) | `df[!df2, on = "grp" ]` | `antijoin(df, df2, on = :grp)` |
bkamins marked this conversation as resolved.
Show resolved Hide resolved
| Semi join (filtering) | `merge(df1, df2[, .(grp)]) ` | `semijoin(df, df2, on = :grp)` |


## Comparison with Stata (version 8 and above)

The following table compares the main functions of DataFrames.jl with Stata:
Expand All @@ -245,7 +309,7 @@ The following table compares the main functions of DataFrames.jl with Stata:
| Add new columns | `egen x_mean = mean(x)` | `transform!(df, :x => mean => :x_mean)` |
| Rename columns | `rename x x_new` | `rename!(df, :x => :x_new)` |
| Pick columns | `keep x y` | `select!(df, :x, :y)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1)` |
| Sort rows | `sort x` | `sort!(df, :x)` |

Note that the suffix `!` (i.e. `transform!`, `select!`, etc) ensures that the operation transforms the dataframe in place, as in Stata
Expand Down