Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecation warning when using insert! with duplicate column name #1308

Merged
merged 9 commits into from
Dec 29, 2017
7 changes: 4 additions & 3 deletions src/abstractdataframe/join.jl
Original file line number Diff line number Diff line change
Expand Up @@ -241,10 +241,11 @@ Join two `DataFrame` objects
- `:cross` : a full Cartesian product of the key combinations; every
row of `df1` is matched with every row of `df2`

* `makeunique` : how to handle columns with duplicate names other than `on` in joined tables:

- `false` : throw an error if duplicate column names are present
- `true` : duplicate column names in `df2` will be deduplicated by adding a suffix
* `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that i can only be 1 in this case, since we don't allow duplicated column names in either of the sources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - it can be a bigger value. See:

julia> x = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.791145 │ 0.193656 │ 0.238548 │
│ 2   │ 0.756677 │ 0.195111 │ 0.929581 │
│ 3   │ 0.736347 │ 0.234449 │ 0.499932 │

julia> y = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2        │ x3       │
├─────┼──────────┼───────────┼──────────┤
│ 1   │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2   │ 0.147731 │ 0.104057  │ 0.144468 │
│ 3   │ 0.282832 │ 0.0247473 │ 0.188086 │

julia> x[:id] = 1:3
1:3

julia> y[:id] = 1:3
1:3
julia> names!(y, [:x1, :x1, :x1, :id], allow_duplicates=true)
3×4 DataFrames.DataFrame
│ Row │ x1       │ x1_1      │ x1_2     │ id │
├─────┼──────────┼───────────┼──────────┼────┤
│ 1   │ 0.220774 │ 0.0541193 │ 0.735224 │ 1  │
│ 2   │ 0.147731 │ 0.104057  │ 0.144468 │ 2  │
│ 3   │ 0.282832 │ 0.0247473 │ 0.188086 │ 3  │

julia> join(x, y, on=:id)
3×7 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │ id │ x1_3     │ x1_1      │ x1_2     │
├─────┼──────────┼──────────┼──────────┼────┼──────────┼───────────┼──────────┤
│ 1   │ 0.791145 │ 0.193656 │ 0.238548 │ 1  │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2   │ 0.756677 │ 0.195111 │ 0.929581 │ 2  │ 0.147731 │ 0.104057  │ 0.144468 │
│ 3   │ 0.736347 │ 0.234449 │ 0.499932 │ 3  │ 0.282832 │ 0.0247473 │ 0.188086 │

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. It would make sense to have a left_suffix/right_suffix argument as in R to choose a deduplication suffix to add to columns from the LHS and RHS. We would need to look at what other software does first.

If we improve this before the next release, we could remove makeunique without a deprecation.

(`i` starting at 1 for the first duplicate).

For the three join operations that may introduce missing values (`:outer`, `:left`,
and `:right`), all columns of the returned data table will support missing values.
Expand Down
10 changes: 4 additions & 6 deletions src/dataframe/dataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,9 @@ DataFrame(ds::Vector{Associative})

* `columns` : a Vector with each column as contents
* `names` : the column names
* `makeunique` : keyword argument indicating how to handle duplicates in `names`

- `false` : throw an error if duplicates are present
- `true` : deduplicate column names by adding a suffix

* `makeunique` : if `false` (the default), an error will be raised
if duplicates in `names` are found; if `true`, duplicate names will be suffixed
with `_i` (`i` starting at 1 for the first duplicate).
* `kwargs` : the key gives the column names, and the value is the
column contents
* `t` : elemental type of all columns
Expand Down Expand Up @@ -702,7 +700,7 @@ merge!(df::DataFrame, others::AbstractDataFrame...)

For every column `c` with name `n` in `others` sequentially perform `df[n] = c`.
In particular, if there are duplicate column names present in `df` and `others`
the last encountered columnwill be retained.
the last encountered column will be retained.
This behavior is identical with how `merge!` works for any `Associative` type.
Use `join` if you want to join two `DataFrame`s.

Expand Down