-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecation warning when using insert! with duplicate column name #1308
Conversation
6978d26
to
05a8a5a
Compare
src/dataframe/dataframe.jl
Outdated
insert!(index(df), col_ind, name) | ||
insert!(df.columns, col_ind, item) | ||
df | ||
end | ||
|
||
function Base.insert!(df::DataFrame, col_ind::Int, item, name::Symbol) | ||
insert!(df, col_ind, upgrade_scalar(df, item), name) | ||
function Base.insert!(df::DataFrame, col_ind::Int, item, name::Symbol; allow_duplicates=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_duplicates=true
-> allow_duplicates=false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course - I have changed my mind to propose a breaking change and forgotten to update it here. Fixed.
05a8a5a
to
931260d
Compare
Makes sense, but I'd rather keep the current behavior and print a warning using |
@nalimilan I have never used |
Just something like |
src/dataframe/dataframe.jl
Outdated
function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::Symbol) | ||
function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, | ||
name::Symbol; allow_duplicates=true) | ||
Base.depwarn(string("Currently insert! allows duplicate column names.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea with deprecations is that they should only be printed when people use the old feature, not all the time. Here, it should only be printed when allow_duplicates=true
and that a duplicate column name is encountered.
Actually, it's not clear we need to add the allow_duplicates
argument: people should stop creating columns with duplicate names anyway so that their code continues to work in the next major release. And passing allow_duplicates=false
isn't really essential, since it will just turn a deprecation warning into an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
people should stop creating columns with duplicate names anyway
Agreed, but then I would remove allow_duplicates
argument from names!
also. If this is OK I will change this PR to only throw depwarn
in names!
and insert!
now when duplicates are encountered (I understand this is the proper way to implement it) - this PR would be merged now. And then have a second PR removing allow_duplicates
argument from names!
and making insert!
disallow duplicates that would be merged after the next DataFrames release.
Is this the way you see we should go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. However, just to be sure, there's no major use case which justifies having an allow_duplicates
argument in names!
? I've just checked, and indeed while R's data.frame
automatically adds a suffix to duplicate column names, tibble
throws an error, so I guess that's reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I guess there is no real use case. The only situation I have encountered is that when source data has duplicate columns - but it was always a bug in source data, and actually catching it was what was the proper way to go.
I guess we should treat DataFrame
as Dict
if indexed by column name. This is an important invariant. So duplicates should not be allowed.
I will adjust this PR accordingly and make this next PR on top of it to be merged later.
e34ee67
to
fb7145d
Compare
Changed according to the discussion. Actually in the end I came to the conclusion that
For |
OK. Though I've realized tibbles still allow columns with duplicate column names on concatenation. Not sure why. What happens after this PR when calling e.g. |
This PR will not change anything. We have two functionalities in Julia now:
Both are reasonable given we document it. If we wanted we could (this is something a minimal change):
A bigger change would be to add But all this is a separate PR from |
Hmm... I would find it kind of inconsistent to print a deprecation warning from I'd say that if we continue to automatically add suffixes in some cases, all functions should support an OTOH, See also some discussions at tidyverse/dplyr#2401 and related issues. |
OK - let me write down a blueprint. If it is accepted then we can go forward with:
Specific assumptions:
My recommendation of to-be functionality:
How
EDIT: the default for We never allow duplicate column names - no matter what R does I would disallow it. |
Thanks for writing this down, we should always have this kind of general plan before making changes. Sounds very reasonable to me. I've posted a comment on the dplyr issue to get more information about their general rule, to be sure we're not missing a use case where duplicates are useful (they are generally not afraid of breaking compatibility with classic R where it makes sense). |
👍 |
OK, Hadley replied that they would like to raise an error everytime duplicate column names are found. So let's go ahead with your plan. |
To second what @nalimilan said, thank you for the blueprint @bkamins. This clarifies a lot of confusion I had with these functions when working with DataTables. Once the functionality is implemented, we should add those descriptions along with code examples to the documentation. |
d34985a
to
efac31d
Compare
OK. Tests pass so this should be ready for a review. What is the consequence:
goes through without an error. If we made I see the following base options (some variants of them are possible - any opinion on what would be best?):
|
482df4e
to
d46554f
Compare
src/dataframe/dataframe.jl
Outdated
@@ -596,19 +605,118 @@ Base.setindex!(df::DataFrame, x::Void, col_ind::Int) = delete!(df, col_ind) | |||
|
|||
Base.empty!(df::DataFrame) = (empty!(df.columns); empty!(index(df)); df) | |||
|
|||
function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::Symbol) | |||
""" | |||
Inserts a column into a `DataFrame` in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be "insert".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/other/index.jl
Outdated
@@ -131,6 +142,15 @@ function add_names(ind::Index, add_ind::Index) | |||
name = u[i] | |||
in(name, seen) ? push!(dups, i) : push!(seen, name) | |||
end | |||
if length(dups) > 0 | |||
if makeunique | |||
Base.depwarn("Keyword makeunique will be false by default.", :add_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but the message should be changed to something like Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically
.
src/dataframe/dataframe.jl
Outdated
function hcat!(df1::DataFrame, df2::AbstractDataFrame) | ||
u = add_names(index(df1), index(df2)) | ||
# TODO: after deprecation period change all to makeunique::Bool=false | ||
function hcat!(df1::DataFrame, df2::AbstractDataFrame; makeunique::Bool=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we set makeunique=false
immediately? Else, now deprecation warning will ever be printed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I get what you are getting at :).
I will set makeunique
to false
everywhere but it will change the code so that it has no consequences for now except printing deprecation. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right!
src/other/utils.jl
Outdated
throw(ArgumentError(msg)) | ||
if length(dups) > 0 | ||
if makeunique | ||
Base.depwarn("Keyword makeunique will be false by default.", :_makeunique) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"keyword argument"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
I wanted to add one more thing for consideration.
will fail after deprecation period (now it prints a warning). |
I don't like printing warnings, especially when there's no way to silence them. I find it better to throw an error and provide a way to avoid it. In the present case, I admit it would be a bit weird to require using |
Thank you for the fixes. As for
|
src/abstractdataframe/io.jl
Outdated
@@ -307,5 +307,5 @@ DataFrame(sink, sch::Data.Schema, ::Type{S}, append::Bool; | |||
append!(sink.columns[col], column) | |||
end | |||
|
|||
|
|||
Data.close!(df::DataFrameStream) = DataFrame(collect(Any, df.columns), Symbol.(df.header)) | |||
Data.close!(df::DataFrameStream, makeunique::Bool=false) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need this? This implements a particular DataStreams interface which is supposed to be completely generic, so the caller should not have to adapt to DataFrame
specificities. We should follow the rule adopted by DataStreams in general with regard to duplicate column names.
@quinnj Should we automatically deduplicate column names, or throw an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to be explicit rather than implicitly assume one type of behavior without thinking about it. Of course if @quinnj has a clear opinion let us implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've just tested, and it turns out CSV.jl will happily return duplicated column names. So let's always pass makeunique=true
to DataFrames
, and remove the keyword argument to close!
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
**Arguments** | ||
|
||
* `df` : the DataFrame to merge into | ||
* `others` : `AbstractDataFrame`s to be merged into `df` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makeunique
should be mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - merge
does not have makeunique
argument. It follows Associative
interface, i.e. it overwrites duplicate columns. I have added an additional explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I have added information about makeunique
to DataFrame
docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny, not sure how I got confused.
@@ -272,10 +274,10 @@ join(name, job2, on = :ID => :identifier) | |||
function Base.join(df1::AbstractDataFrame, | |||
df2::AbstractDataFrame; | |||
on::Union{<:OnType, AbstractVector{<:OnType}} = Symbol[], | |||
kind::Symbol = :inner) | |||
kind::Symbol = :inner, makeunique::Bool=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed? By definition, joins match columns with identical names, so that shouldn't be a problem? Anyway it should be in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is needed. Join matches on on
keyword argument not on columns having identical names. And we can have columns with identical names on which we do not join.
See the example as it works now:
julia> x = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 0.97283 │ 0.200961 │ 0.591387 │
│ 2 │ 0.391169 │ 0.532865 │ 0.630848 │
│ 3 │ 0.090204 │ 0.320481 │ 0.775537 │
julia> y = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 0.470765 │ 0.468833 │ 0.199083 │
│ 2 │ 0.303053 │ 0.992384 │ 0.140139 │
│ 3 │ 0.284372 │ 0.816792 │ 0.675941 │
julia> x[:id] = 1:3
1:3
julia> y[:id] = 1:3
1:3
julia> join(x, y, on=:id)
3×7 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ id │ x1_1 │ x2_1 │ x3_1 │
├─────┼──────────┼──────────┼──────────┼────┼──────────┼──────────┼──────────┤
│ 1 │ 0.97283 │ 0.200961 │ 0.591387 │ 1 │ 0.470765 │ 0.468833 │ 0.199083 │
│ 2 │ 0.391169 │ 0.532865 │ 0.630848 │ 2 │ 0.303053 │ 0.992384 │ 0.140139 │
│ 3 │ 0.090204 │ 0.320481 │ 0.775537 │ 3 │ 0.284372 │ 0.816792 │ 0.675941 │
I will improve the docstring.
…Frames.jl into add_duplicate_check
src/dataframe/dataframe.jl
Outdated
@@ -692,6 +701,8 @@ merge!(df::DataFrame, others::AbstractDataFrame...) | |||
``` | |||
|
|||
For every column `c` with name `n` in `others` sequentially perform `df[n] = c`. | |||
In particular, if there are duplicate column names present in `df` and `others` | |||
the last encountered columnwill be retained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"columnwill"? :-)
src/abstractdataframe/join.jl
Outdated
@@ -241,6 +241,11 @@ Join two `DataFrame` objects | |||
- `:cross` : a full Cartesian product of the key combinations; every | |||
row of `df1` is matched with every row of `df2` | |||
|
|||
* `makeunique` : how to handle columns with duplicate names other than `on` in joined tables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this more concise? A bullet list sounds too much for a mere boolean. Why not use a description similar to that used for names!
?
Also, I know I used it, but I don't think "deduplicate" is a great term for an official documentation. "Make unique" sounds better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - fixing
src/abstractdataframe/join.jl
Outdated
- `true` : duplicate column names in `df2` will be deduplicated by adding a suffix | ||
* `makeunique` : if `false` (the default), an error will be raised | ||
if duplicate names are found in columns not joined on; | ||
if `true`, duplicate names will be suffixed with `_i` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I right that i
can only be 1
in this case, since we don't allow duplicated column names in either of the sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - it can be a bigger value. See:
julia> x = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
├─────┼──────────┼──────────┼──────────┤
│ 1 │ 0.791145 │ 0.193656 │ 0.238548 │
│ 2 │ 0.756677 │ 0.195111 │ 0.929581 │
│ 3 │ 0.736347 │ 0.234449 │ 0.499932 │
julia> y = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │
├─────┼──────────┼───────────┼──────────┤
│ 1 │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2 │ 0.147731 │ 0.104057 │ 0.144468 │
│ 3 │ 0.282832 │ 0.0247473 │ 0.188086 │
julia> x[:id] = 1:3
1:3
julia> y[:id] = 1:3
1:3
julia> names!(y, [:x1, :x1, :x1, :id], allow_duplicates=true)
3×4 DataFrames.DataFrame
│ Row │ x1 │ x1_1 │ x1_2 │ id │
├─────┼──────────┼───────────┼──────────┼────┤
│ 1 │ 0.220774 │ 0.0541193 │ 0.735224 │ 1 │
│ 2 │ 0.147731 │ 0.104057 │ 0.144468 │ 2 │
│ 3 │ 0.282832 │ 0.0247473 │ 0.188086 │ 3 │
julia> join(x, y, on=:id)
3×7 DataFrames.DataFrame
│ Row │ x1 │ x2 │ x3 │ id │ x1_3 │ x1_1 │ x1_2 │
├─────┼──────────┼──────────┼──────────┼────┼──────────┼───────────┼──────────┤
│ 1 │ 0.791145 │ 0.193656 │ 0.238548 │ 1 │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2 │ 0.756677 │ 0.195111 │ 0.929581 │ 2 │ 0.147731 │ 0.104057 │ 0.144468 │
│ 3 │ 0.736347 │ 0.234449 │ 0.499932 │ 3 │ 0.282832 │ 0.0247473 │ 0.188086 │
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. It would make sense to have a left_suffix
/right_suffix
argument as in R to choose a deduplication suffix to add to columns from the LHS and RHS. We would need to look at what other software does first.
If we improve this before the next release, we could remove makeunique
without a deprecation.
Thanks! |
@quinnj, @nalimilan following #1495 (comment) the reasoning is that I want to use DataFrames.jl in production, not only in interactive mode. In interactive mode probably the default of Anyway there are two points:
|
I lean towards having |
I can see the benefits of this decision and I will not oppose going this way. If we decide to go this way then after #1495 is merged I can:
|
@nalimilan - any opinion on the default for |
I still like the @quinnj Do you have a precise scenario in mind where it would be annoying? |
This is a small proposal of improvement and breaking but probably
insert!
was not used much as it is not documented. We can changeallow_duplicates
totrue
by default if we want a non-breaking change.In general the idea is to make sure that no functions in DataFrames would allow to make duplicate column names by default.