Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecation warning when using insert! with duplicate column name #1308

Merged
merged 9 commits into from
Dec 29, 2017

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Dec 5, 2017

This is a small proposal of improvement and breaking but probably insert! was not used much as it is not documented. We can change allow_duplicates to true by default if we want a non-breaking change.

In general the idea is to make sure that no functions in DataFrames would allow to make duplicate column names by default.

@bkamins bkamins force-pushed the add_duplicate_check branch from 6978d26 to 05a8a5a Compare December 5, 2017 21:36
insert!(index(df), col_ind, name)
insert!(df.columns, col_ind, item)
df
end

function Base.insert!(df::DataFrame, col_ind::Int, item, name::Symbol)
insert!(df, col_ind, upgrade_scalar(df, item), name)
function Base.insert!(df::DataFrame, col_ind::Int, item, name::Symbol; allow_duplicates=true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow_duplicates=true -> allow_duplicates=false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course - I have changed my mind to propose a breaking change and forgotten to update it here. Fixed.

@bkamins bkamins force-pushed the add_duplicate_check branch from 05a8a5a to 931260d Compare December 5, 2017 21:45
@nalimilan
Copy link
Member

Makes sense, but I'd rather keep the current behavior and print a warning using Base.depwarn. Then we can switch to the new behavior in the next breaking release.

@bkamins
Copy link
Member Author

bkamins commented Dec 5, 2017

@nalimilan I have never used Base.depwarn before. What would be a recommended way to do it here? Thanks for explaining.

@nalimilan
Copy link
Member

Just something like Base.depwarn("your message", :insert!). There are examples in base/deprecations.jl. The point of that method is that warnings are printed with a stacktraces, and shown only once for each call site.

function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::Symbol)
function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector,
name::Symbol; allow_duplicates=true)
Base.depwarn(string("Currently insert! allows duplicate column names.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with deprecations is that they should only be printed when people use the old feature, not all the time. Here, it should only be printed when allow_duplicates=true and that a duplicate column name is encountered.

Actually, it's not clear we need to add the allow_duplicates argument: people should stop creating columns with duplicate names anyway so that their code continues to work in the next major release. And passing allow_duplicates=false isn't really essential, since it will just turn a deprecation warning into an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

people should stop creating columns with duplicate names anyway

Agreed, but then I would remove allow_duplicates argument from names! also. If this is OK I will change this PR to only throw depwarn in names! and insert! now when duplicates are encountered (I understand this is the proper way to implement it) - this PR would be merged now. And then have a second PR removing allow_duplicates argument from names! and making insert! disallow duplicates that would be merged after the next DataFrames release.

Is this the way you see we should go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. However, just to be sure, there's no major use case which justifies having an allow_duplicates argument in names!? I've just checked, and indeed while R's data.frame automatically adds a suffix to duplicate column names, tibble throws an error, so I guess that's reasonable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I guess there is no real use case. The only situation I have encountered is that when source data has duplicate columns - but it was always a bug in source data, and actually catching it was what was the proper way to go.

I guess we should treat DataFrame as Dict if indexed by column name. This is an important invariant. So duplicates should not be allowed.

I will adjust this PR accordingly and make this next PR on top of it to be merged later.

@bkamins bkamins changed the title Add allow_duplicates keyword argument to insert! with default to false Deprecation warning when using insert! with duplicate column name Dec 6, 2017
@bkamins bkamins force-pushed the add_duplicate_check branch 2 times, most recently from e34ee67 to fb7145d Compare December 6, 2017 16:26
@bkamins
Copy link
Member Author

bkamins commented Dec 6, 2017

Changed according to the discussion. Actually in the end I came to the conclusion that names! can work as it works now:

  • throw an error by default;
  • generate new unique names wthen allow_duplicates is true.

For insert! I think that allow_duplicates will be not needed. Here we insert one column (not do mass rename as in names!) so it is enough to throw an error (this will be done in another PR when this one is marged).

@nalimilan
Copy link
Member

OK. Though I've realized tibbles still allow columns with duplicate column names on concatenation. Not sure why. What happens after this PR when calling e.g. hcat(df, df)?

@bkamins
Copy link
Member Author

bkamins commented Dec 6, 2017

This PR will not change anything.

We have two functionalities in Julia now:

  • hcat: will generate new unique name (and do not warn);
  • merge! will overwrite one column with the other (and also do not warn);

Both are reasonable given we document it.

If we wanted we could (this is something a minimal change):

  • make hcat print a warning when duplicates are encountered (this is not a problem as hcat is not in place operation);
  • make merge! also throw a warning and go on overwriting the columns

A bigger change would be to add allow_duplicates to hcat and merge! that would work exactly the same as it works now for names! (which is a good solution - if false it is an error to have a duplicate, if true generate a new unique name).

But all this is a separate PR from insert! functionality.

@nalimilan
Copy link
Member

Hmm... I would find it kind of inconsistent to print a deprecation warning from insert!, but still allow duplicates in hcat. So we should find a global strategy and change everything from the same PR.

I'd say that if we continue to automatically add suffixes in some cases, all functions should support an allow_duplicates argument (which should arguably use a more explicit name like make_unique), and they should throw an error by default.

OTOH, merge! is OK as-is since it's part of the Associative interface. We could actually do the same for insert!: that would be consistent with DataStructures. Then we wouldn't have to bother about duplicate names. :-)

See also some discussions at tidyverse/dplyr#2401 and related issues.

@bkamins
Copy link
Member Author

bkamins commented Dec 6, 2017

OK - let me write down a blueprint. If it is accepted then we can go forward with:

  • this PR: make appropriate deprecations where required;
  • the following PR: new functionality as described below;

Specific assumptions:

  1. insert! does not follow Associative interface, actually is follows Vector interface; the invariant is that on success the number of columns in the data frame must increase by 1.
  2. hcat intuitively should follow also Vector interface (i.e. if we have n and m columns in two DataFrames after hcat we must have n+m columns - this is the invariant).
  3. merge! follows Associative.

My recommendation of to-be functionality:

  1. insert!: should have make_uniqueargument that works identically like allow_duplicates now;
  2. hcat: should have make_unique argument that works identically like allow_duplicates now;
  3. names!: should have make_unique argument that works identically like allow_duplicates now (so it is a rename of keyword argument);
  4. merge!: leave the functionality as is (i.e. duplicates should be overwritten) and clearly document it;
  5. internal (not exported) functions that currently use allow_duplicates should be adjusted accordingly.
  6. in general add complete docstrings to all those functions.

How make_unique works:

  • if true we generate new column names for duplicate column names;
  • if false we throw error when duplicate is detected;

EDIT: the default for make_unique should be false everywhere.

We never allow duplicate column names - no matter what R does I would disallow it.

@nalimilan
Copy link
Member

Thanks for writing this down, we should always have this kind of general plan before making changes. Sounds very reasonable to me. I've posted a comment on the dplyr issue to get more information about their general rule, to be sure we're not missing a use case where duplicates are useful (they are generally not afraid of breaking compatibility with classic R where it makes sense).

@bkamins
Copy link
Member Author

bkamins commented Dec 7, 2017

👍
So let us wait for the response and then I can go on with an appropriate PR.

@nalimilan
Copy link
Member

OK, Hadley replied that they would like to raise an error everytime duplicate column names are found. So let's go ahead with your plan.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 73.21% when pulling fb7145d on bkamins:add_duplicate_check into 20ab5c5 on JuliaData:master.

@JuliaData JuliaData deleted a comment from coveralls Dec 8, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 8, 2017
@JuliaData JuliaData deleted a comment from coveralls Dec 8, 2017
@cjprybol
Copy link
Contributor

cjprybol commented Dec 8, 2017

To second what @nalimilan said, thank you for the blueprint @bkamins. This clarifies a lot of confusion I had with these functions when working with DataTables. Once the functionality is implemented, we should add those descriptions along with code examples to the documentation.

@coveralls
Copy link

coveralls commented Dec 8, 2017

Coverage Status

Coverage increased (+1.6%) to 74.804% when pulling fb7145d on bkamins:add_duplicate_check into 20ab5c5 on JuliaData:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.03%) to 73.195% when pulling fb7145d on bkamins:add_duplicate_check into 20ab5c5 on JuliaData:master.

@bkamins bkamins force-pushed the add_duplicate_check branch 3 times, most recently from d34985a to efac31d Compare December 14, 2017 22:03
@bkamins
Copy link
Member Author

bkamins commented Dec 14, 2017

OK. Tests pass so this should be ready for a review.
One thing I want to raise is what should be the default value for make_unique.
I have made it false except Index, where I kept it true as it was till now.

What is the consequence:

DataFrame(Any[[1,2], [1,2], [:x,:x])

goes through without an error. If we made make_unique equal to false by default this would be an error. The consequence is that we then probably should add make_unique keyword argument to DataFrame, join etc. or otherwise they would throw errors on duplicates.

I see the following base options (some variants of them are possible - any opinion on what would be best?):

  1. make make_unique equal to true by default in every function (after seeing what happens I would opt for this - users should then be warned that by default DataFrames tries to add columns generating new names, but in some functions - insert!, hcat and names! this default behavior can be switched to a safe approach when an error is thrown on duplicate);
  2. remove make_unique keyword argument (old allow_duplicates) altogether and always dynamically generate new names for variables (a user should independently check before an operation if there are duplicates in names).
  3. keep the solution as it is implemented now (it is false by default except Index where it is true by default) - then insert!, hcat and names! are strict by default and throw an error and all other functions generate unique duplicate names by default (a bit inconsistent but maybe a good compromise);
  4. make make_unique equal to false by default (it is the safest approach but very breaking and probably should be followed by adding make_unique keyword argument to some more functions).

@bkamins bkamins force-pushed the add_duplicate_check branch from 482df4e to d46554f Compare December 24, 2017 23:18
@@ -596,19 +605,118 @@ Base.setindex!(df::DataFrame, x::Void, col_ind::Int) = delete!(df, col_ind)

Base.empty!(df::DataFrame) = (empty!(df.columns); empty!(index(df)); df)

function Base.insert!(df::DataFrame, col_ind::Int, item::AbstractVector, name::Symbol)
"""
Inserts a column into a `DataFrame` in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be "insert".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -131,6 +142,15 @@ function add_names(ind::Index, add_ind::Index)
name = u[i]
in(name, seen) ? push!(dups, i) : push!(seen, name)
end
if length(dups) > 0
if makeunique
Base.depwarn("Keyword makeunique will be false by default.", :add_names)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the message should be changed to something like Duplicate variable names are deprecated: pass makeunique=true to add a suffix automatically.

function hcat!(df1::DataFrame, df2::AbstractDataFrame)
u = add_names(index(df1), index(df2))
# TODO: after deprecation period change all to makeunique::Bool=false
function hcat!(df1::DataFrame, df2::AbstractDataFrame; makeunique::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we set makeunique=false immediately? Else, now deprecation warning will ever be printed, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I get what you are getting at :).
I will set makeunique to false everywhere but it will change the code so that it has no consequences for now except printing deprecation. Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right!

throw(ArgumentError(msg))
if length(dups) > 0
if makeunique
Base.depwarn("Keyword makeunique will be false by default.", :_makeunique)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"keyword argument"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@bkamins
Copy link
Member Author

bkamins commented Dec 27, 2017

I wanted to add one more thing for consideration.
Given code

x = DataFrame(a=1:3)
y = x[[1,1]]

will fail after deprecation period (now it prints a warning).
Maybe in getindex we should allow duplicates by default (possibly with printing a warning that column aliases are being created?)?

@nalimilan
Copy link
Member

Maybe in getindex we should allow duplicates by default (possibly with printing a warning that column aliases are being created?)?

I don't like printing warnings, especially when there's no way to silence them. I find it better to throw an error and provide a way to avoid it. In the present case, I admit it would be a bit weird to require using getindex with a keyword argument. But maybe better wait and see whether somebody has a use case for this.

@bkamins
Copy link
Member Author

bkamins commented Dec 29, 2017

Thank you for the fixes. As for getindex I agree and that is why I have implemented it as is, but I wanted to be explicit. In summary I understand the reasons are:

  • if we do not allow it now and start allowing it later it is not breaking;
  • as such an operation creates an alias I do not see much use for it (and I would rather discourage it as error prone - in base R it creates a copy so it is less risky); if someone wants an alias it is always possible with something like df[:y] = df[:x], but this is explicit.

@@ -307,5 +307,5 @@ DataFrame(sink, sch::Data.Schema, ::Type{S}, append::Bool;
append!(sink.columns[col], column)
end


Data.close!(df::DataFrameStream) = DataFrame(collect(Any, df.columns), Symbol.(df.header))
Data.close!(df::DataFrameStream, makeunique::Bool=false) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this? This implements a particular DataStreams interface which is supposed to be completely generic, so the caller should not have to adapt to DataFrame specificities. We should follow the rule adopted by DataStreams in general with regard to duplicate column names.

@quinnj Should we automatically deduplicate column names, or throw an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be explicit rather than implicitly assume one type of behavior without thinking about it. Of course if @quinnj has a clear opinion let us implement it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've just tested, and it turns out CSV.jl will happily return duplicated column names. So let's always pass makeunique=true to DataFrames, and remove the keyword argument to close!.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

**Arguments**

* `df` : the DataFrame to merge into
* `others` : `AbstractDataFrame`s to be merged into `df`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makeunique should be mentioned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - merge does not have makeunique argument. It follows Associative interface, i.e. it overwrites duplicate columns. I have added an additional explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I have added information about makeunique to DataFrame docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, not sure how I got confused.

@@ -272,10 +274,10 @@ join(name, job2, on = :ID => :identifier)
function Base.join(df1::AbstractDataFrame,
df2::AbstractDataFrame;
on::Union{<:OnType, AbstractVector{<:OnType}} = Symbol[],
kind::Symbol = :inner)
kind::Symbol = :inner, makeunique::Bool=false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? By definition, joins match columns with identical names, so that shouldn't be a problem? Anyway it should be in the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed. Join matches on on keyword argument not on columns having identical names. And we can have columns with identical names on which we do not join.
See the example as it works now:

julia> x = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.97283  │ 0.200961 │ 0.591387 │
│ 2   │ 0.391169 │ 0.532865 │ 0.630848 │
│ 3   │ 0.090204 │ 0.320481 │ 0.775537 │

julia> y = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.470765 │ 0.468833 │ 0.199083 │
│ 2   │ 0.303053 │ 0.992384 │ 0.140139 │
│ 3   │ 0.284372 │ 0.816792 │ 0.675941 │

julia> x[:id] = 1:3
1:3

julia> y[:id] = 1:3
1:3

julia> join(x, y, on=:id)
3×7 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │ id │ x1_1     │ x2_1     │ x3_1     │
├─────┼──────────┼──────────┼──────────┼────┼──────────┼──────────┼──────────┤
│ 1   │ 0.97283  │ 0.200961 │ 0.591387 │ 1  │ 0.470765 │ 0.468833 │ 0.199083 │
│ 2   │ 0.391169 │ 0.532865 │ 0.630848 │ 2  │ 0.303053 │ 0.992384 │ 0.140139 │
│ 3   │ 0.090204 │ 0.320481 │ 0.775537 │ 3  │ 0.284372 │ 0.816792 │ 0.675941 │

I will improve the docstring.

@@ -692,6 +701,8 @@ merge!(df::DataFrame, others::AbstractDataFrame...)
```

For every column `c` with name `n` in `others` sequentially perform `df[n] = c`.
In particular, if there are duplicate column names present in `df` and `others`
the last encountered columnwill be retained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"columnwill"? :-)

@@ -241,6 +241,11 @@ Join two `DataFrame` objects
- `:cross` : a full Cartesian product of the key combinations; every
row of `df1` is matched with every row of `df2`

* `makeunique` : how to handle columns with duplicate names other than `on` in joined tables:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this more concise? A bullet list sounds too much for a mere boolean. Why not use a description similar to that used for names!?

Also, I know I used it, but I don't think "deduplicate" is a great term for an official documentation. "Make unique" sounds better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - fixing

- `true` : duplicate column names in `df2` will be deduplicated by adding a suffix
* `makeunique` : if `false` (the default), an error will be raised
if duplicate names are found in columns not joined on;
if `true`, duplicate names will be suffixed with `_i`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that i can only be 1 in this case, since we don't allow duplicated column names in either of the sources?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - it can be a bigger value. See:

julia> x = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.791145 │ 0.193656 │ 0.238548 │
│ 2   │ 0.756677 │ 0.195111 │ 0.929581 │
│ 3   │ 0.736347 │ 0.234449 │ 0.499932 │

julia> y = DataFrame(rand(3,3))
3×3 DataFrames.DataFrame
│ Row │ x1       │ x2        │ x3       │
├─────┼──────────┼───────────┼──────────┤
│ 1   │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2   │ 0.147731 │ 0.104057  │ 0.144468 │
│ 3   │ 0.282832 │ 0.0247473 │ 0.188086 │

julia> x[:id] = 1:3
1:3

julia> y[:id] = 1:3
1:3
julia> names!(y, [:x1, :x1, :x1, :id], allow_duplicates=true)
3×4 DataFrames.DataFrame
│ Row │ x1       │ x1_1      │ x1_2     │ id │
├─────┼──────────┼───────────┼──────────┼────┤
│ 1   │ 0.220774 │ 0.0541193 │ 0.735224 │ 1  │
│ 2   │ 0.147731 │ 0.104057  │ 0.144468 │ 2  │
│ 3   │ 0.282832 │ 0.0247473 │ 0.188086 │ 3  │

julia> join(x, y, on=:id)
3×7 DataFrames.DataFrame
│ Row │ x1       │ x2       │ x3       │ id │ x1_3     │ x1_1      │ x1_2     │
├─────┼──────────┼──────────┼──────────┼────┼──────────┼───────────┼──────────┤
│ 1   │ 0.791145 │ 0.193656 │ 0.238548 │ 1  │ 0.220774 │ 0.0541193 │ 0.735224 │
│ 2   │ 0.756677 │ 0.195111 │ 0.929581 │ 2  │ 0.147731 │ 0.104057  │ 0.144468 │
│ 3   │ 0.736347 │ 0.234449 │ 0.499932 │ 3  │ 0.282832 │ 0.0247473 │ 0.188086 │

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. It would make sense to have a left_suffix/right_suffix argument as in R to choose a deduplication suffix to add to columns from the LHS and RHS. We would need to look at what other software does first.

If we improve this before the next release, we could remove makeunique without a deprecation.

@nalimilan nalimilan merged commit f8406ce into JuliaData:master Dec 29, 2017
@nalimilan
Copy link
Member

Thanks!

@bkamins bkamins deleted the add_duplicate_check branch December 29, 2017 13:15
@nalimilan nalimilan mentioned this pull request Sep 6, 2018
@bkamins
Copy link
Member Author

bkamins commented Sep 6, 2018

@quinnj, @nalimilan following #1495 (comment) the reasoning is that I want to use DataFrames.jl in production, not only in interactive mode.

In interactive mode probably the default of makeunique to true is OK. In production I would hate this and prefer error than silently renaming the columns as the code downstream might rely on column names.

Anyway there are two points:

  1. We have makeunique parameter that allows both modes and we make sure that it is consistently handled in all DataFrames.jl ecosystem;
  2. The question is what should be the default value of makeunique - now we think it should be false (so duplicates will throw an error) - this is identical behavior to NamedTuple - it throws an error on duplicate in constructor. But I am open to the discussion to change the default to true. Personally I even in REPL prefer makeunique default to false, but maybe for other people it is too annoying.

@quinnj
Copy link
Member

quinnj commented Sep 6, 2018

I lean towards having makeunique=true be the default; my basis for that would be the assumption that most users are using Julia interactively, from the REPL or scripts, and it's much more convenient to not have things throw errors all over the place. When running a production app, it's also easier to add in the single keyword argument makeunique=false and add try-catch blocks as necessary.

@bkamins
Copy link
Member Author

bkamins commented Sep 6, 2018

I can see the benefits of this decision and I will not oppose going this way.

If we decide to go this way then after #1495 is merged I can:

  • cleanup whole code base to support it;
  • can remove deprecation warnings (as they are not needed);
  • make it clear in documentation and my tutorial what are the consequences (so that people are aware to use makeunique in scripts)

@bkamins
Copy link
Member Author

bkamins commented Sep 10, 2018

@nalimilan - any opinion on the default for makeunique?

@nalimilan
Copy link
Member

I still like the makeunique=false default. As I've noted above, even dplyr is leaning towards throwing errors in these cases (tidyverse/dplyr#2401 (comment)). Even when working interactively, if you combine two datasets which have column :x, and then do df[:x], you can get totally incorrect results if you expected to access the column from the second data frame. For scientific work in particular, preventing this kind of mistake is essential. And in general throwing an error early helps debugging.

@quinnj Do you have a precise scenario in mind where it would be annoying?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants