Skip to content

Commit

Permalink
reviewed and fixed comparisons with other frameworks (#2755)
Browse files Browse the repository at this point in the history
  • Loading branch information
RohitRathore1 authored May 9, 2021
1 parent d5e0957 commit 7b41a0b
Showing 1 changed file with 33 additions and 28 deletions.
61 changes: 33 additions & 28 deletions docs/src/man/comparisons.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ df = DataFrame(grp = repeat(1:2, 3), x = 6:-1:1, y = 4:9, z = [3:7; missing], id
df2 = DataFrame(grp = [1, 3], w = [10, 11])
```

!!! note

Some of the operations mutate the tables so every operation assumes that it is done on the original data frame.

Note that in the comparisons presented below predicates like `x -> x >= 1` can
be more compactly written as `=>(1)`. The latter form has an additional benefit
that it is compiled only once per Julia session (as opposed to `x -> x >= 1`
Expand Down Expand Up @@ -120,6 +124,7 @@ over each group independently. The result of `groupby` is a `GroupedDataFrame` o
which may be processed using the `combine`, `transform`, or `select` functions.
The following table illustrates some common grouping and aggregation usages.


| Operation | pandas | DataFrames.jl |
|:--------------------------------|:---------------------------------------------------------------------------------------|:-----------------------------------------------------|
| Aggregate by groups | `df.groupby('grp')['x'].mean()` | `combine(groupby(df, :grp), :x => mean)` |
Expand All @@ -144,11 +149,11 @@ For DataFrames.jl, it looks like this:
```julia
julia> combine(groupby(df, :grp), :x => mean)
2×2 DataFrame
Row │ grp x_mean
│ Int64 Float64
─────┼────────────────
1 1 4.0
2 2 3.0
Row │ grp x_mean
│ Int64 Float64
─────┼────────────────
11 4.0
22 3.0
```

In DataFrames.jl, the `GroupedDataFrame` object supports an efficient key lookup.
Expand All @@ -159,16 +164,16 @@ Hence, it performs well when you need to perform lookups repeatedly.
This section includes more complex examples.

| Operation | pandas | DataFrames.jl |
|:---------------------------------------|:-----------------------------------------------------------------------------|:----------------------------------------------------------|
| Complex Function | `df[['z']].agg(lambda v: np.mean(np.cos(v)))` | `combine(df, :z => v -> mean(cos, skipmissing(v)))` |
| Aggregate multiple columns | `df.agg({'x': max, 'y': min})` | `combine(df, :x => maximum, :y => minimum)` |
| | `df[['x', 'y']].mean()` | `combine(df, [:x, :y] .=> mean)` |
| | `df.filter(regex=("^x")).mean()` | `combine(df, names(df, r"^x") .=> mean)` |
| Apply function over multiple variables | `df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1])` | `transform(df, [:x, :y] => cor)` |
| Row-wise operation | `df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1))` | `transform(df, [:x, :y] => ByRow(min))` |
| | `df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1))` | `transform(df, AsTable([:x, :y]) => ByRow(argmax))` |
| DataFrame as input | `df.groupby('grp').head(2)` | `combine(d -> first(d, 2), groupby(df, :grp))` |
| DataFrame as output | `df[['x']].agg(lambda x: [min(x), max(x)])` | `combine(:x => x -> (x = [minimum(x), maximum(x)],), df)` |
|:---------------------------------------|:-----------------------------------------------------------------------------|:------------------------------------------------------------------------|
| Complex Function | `df[['z']].agg(lambda v: np.mean(np.cos(v)))` | `combine(df, :z => v -> mean(cos, skipmissing(v)))` |
| Aggregate multiple columns | `df.agg({'x': max, 'y': min})` | `combine(df, :x => maximum, :y => minimum)` |
| | `df[['x', 'y']].mean()` | `combine(df, [:x, :y] .=> mean)` |
| | `df.filter(regex=("^x")).mean()` | `combine(df, names(df, r"^x") .=> mean)` |
| Apply function over multiple variables | `df.assign(x_y_cor = np.corrcoef(df.x, df.y)[0, 1])` | `transform(df, [:x, :y] => cor)` |
| Row-wise operation | `df.assign(x_y_min = df.apply(lambda v: min(v.x, v.y), axis=1))` | `transform(df, [:x, :y] => ByRow(min))` |
| | `df.assign(x_y_argmax = df.apply(lambda v: df.columns[v.argmax()], axis=1))` | `transform(df, AsTable([:x, :y]) => ByRow(argmax))` |
| DataFrame as input | `df.groupby('grp').head(2)` | `combine(d -> first(d, 2), groupby(df, :grp))` |
| DataFrame as output | `df[['x']].agg(lambda x: [min(x), max(x)])` | `combine(df, :x => (x -> (x = [minimum(x), maximum(x)],)) => AsTable)` |

Note that pandas preserves the same row order after `groupby` whereas DataFrames.jl
shows them grouped by the provided keys after the `combine` operation,
Expand Down Expand Up @@ -222,18 +227,18 @@ As in dplyr, some of these functions can be applied to grouped data frames, in w

The table below compares more advanced commands:

| Operation | dplyr | DataFrames.jl |
|:--------------------------|:----------------------------------------------------------|:--------------------------------------------------------------|
| Complex Function | `summarize(df, mean(x, na.rm = T))` | `combine(df, :x => x -> mean(skipmissing(x)))` |
| Transform several columns | `summarize(df, max(x), min(y))` | `combine(df, :x => maximum, :y => minimum)` |
| | `summarize(df, across(c(x, y), mean))` | `combine(df, [:x, :y] .=> mean)` |
| | `summarize(df, across(starts_with("x"), mean))` | `combine(df, names(df, r"^x") .=> mean)` |
| | `summarize(df, across(c(x, y), list(max, min)))` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` |
| Multivariate function | `mutate(df, cor(x, y))` | `transform(df, [:x, :y] => cor)` |
| Row-wise | `mutate(rowwise(df), min(x, y))` | `transform(df, [:x, :y] => ByRow(min))` |
| | `mutate(rowwise(df), which.max(c_across(matches("^x"))))` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
| DataFrame as input | `summarize(df, head(across(), 2))` | `combine(d -> first(d, 2), df)` |
| DataFrame as output | `summarize(df, tibble(value = c(min(x), max(x))))` | `combine(:x => x -> (value = [minimum(x), maximum(x)],), df)` |
| Operation | dplyr | DataFrames.jl |
|:--------------------------|:----------------------------------------------------------|:---------------------------------------------------------------------------|
| Complex Function | `summarize(df, mean(x, na.rm = T))` | `combine(df, :x => x -> mean(skipmissing(x)))` |
| Transform several columns | `summarize(df, max(x), min(y))` | `combine(df, :x => maximum, :y => minimum)` |
| | `summarize(df, across(c(x, y), mean))` | `combine(df, [:x, :y] .=> mean)` |
| | `summarize(df, across(starts_with("x"), mean))` | `combine(df, names(df, r"^x") .=> mean)` |
| | `summarize(df, across(c(x, y), list(max, min)))` | `combine(df, ([:x, :y] .=> [maximum minimum])...)` |
| Multivariate function | `mutate(df, cor(x, y))` | `transform(df, [:x, :y] => cor)` |
| Row-wise | `mutate(rowwise(df), min(x, y))` | `transform(df, [:x, :y] => ByRow(min))` |
| | `mutate(rowwise(df), which.max(c_across(matches("^x"))))` | `transform(df, AsTable(r"^x") => ByRow(argmax))` |
| DataFrame as input | `summarize(df, head(across(), 2))` | `combine(d -> first(d, 2), df)` |
| DataFrame as output | `summarize(df, tibble(value = c(min(x), max(x))))` | `combine(df, :x => (x -> (value = [minimum(x), maximum(x)],)) => AsTable)` |

## Comparison with Stata (version 8 and above)

Expand All @@ -245,7 +250,7 @@ The following table compares the main functions of DataFrames.jl with Stata:
| Add new columns | `egen x_mean = mean(x)` | `transform!(df, :x => mean => :x_mean)` |
| Rename columns | `rename x x_new` | `rename!(df, :x => :x_new)` |
| Pick columns | `keep x y` | `select!(df, :x, :y)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1)` |
| Pick rows | `keep if x >= 1` | `subset!(df, :x => ByRow(x -> x >= 1))` |
| Sort rows | `sort x` | `sort!(df, :x)` |

Note that the suffix `!` (i.e. `transform!`, `select!`, etc) ensures that the operation transforms the dataframe in place, as in Stata
Expand Down

0 comments on commit 7b41a0b

Please sign in to comment.