Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add duplicated/drop_duplicates to Index #7979

Merged
merged 1 commit into from
Aug 15, 2014

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Aug 10, 2014

Closes #4060.

idx = pd.Index([1, 2, 3, 4, 1, 2])
idx.duplicated()
# Index([False, False, False, False, True, True], dtype='bool')
idx.drop_duplicates()
# Int64Index([1, 2, 3, 4], dtype='int64')

idx.duplicated(take_last=True)
# Index([True, True, False, False, False, False], dtype='bool')
idx.drop_duplicates(take_last=True)
# Int64Index([3, 4, 1, 2], dtype='int64')

@@ -443,6 +444,53 @@ def searchsorted(self, key, side='left'):
#### needs tests/doc-string
return self.values.searchsorted(key, side=side)

def drop_duplicates(self, take_last=False, inplace=False):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these need to raise if inplace and it's anindex (as they are immutable)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note - can u audit existing methods in indexOps for using inplace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no func accepts inplace other than this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, gr8. still I think putting the check on update_inplace might be good

@jreback jreback added this to the 0.15.0 milestone Aug 11, 2014

if inplace:
from pandas.core.index import Index
if isinstance(self, Index):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better is to have update_inplace in core/base.py that simply raises if its an Index (I think this would be overriden by the update_inplace in core/generic.py and so other sub-classes won't see it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I think adding update_inplace to Index is clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes, that would be better (though maybe add as a NotIMplemented to OpsMixIn just as a place holder for the abstract methdos)

@@ -469,6 +470,54 @@ def searchsorted(self, key, side='left'):
#### needs tests/doc-string
return self.values.searchsorted(key, side=side)

def drop_duplicates(self, take_last=False, inplace=False):
"""
Return Series or Index with duplicate values removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the Series or Index, could you do something like in generic.py with substitution of klass name so that only Series or Index shows up in the respective docstring?

@sinhrks
Copy link
Member Author

sinhrks commented Aug 12, 2014

@jreback, @jorisvandenbossche Considering both comments and fixed.

Defining update_inplace in IndexOpsMixin results in Series to refer it because of inheritance order. Thus, functions which uses update_inplace must be overridden eventually.
https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L77

So defined common logic in IndexOpsMixin and override in Index and Series each. As a result, Index no longer need update_inplace because it will not accept inplace keyword.

try:
return self._constructor(duplicated,
index=self.index).__finalize__(self)
except AttributeError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very awkward to do. Maybe just put the immutable definition in base and override the definition in series. prob simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed to centralize the logic to IndexOpsMixin. Even though update_inplace is defined in both IndexOpsMixin and Index, it will never called in drop_duplicates case (Index.drop_duplicates blocks inplace kw, and it is better for proper docstring)

@jorisvandenbossche
Copy link
Member

Just as a usage question, what do we envisage as the 'recommended' way to drop duplicate indices from a DataFrame (where you now had to say the somewhat unintuitive df.groupby(level=0).first())?:

df[~df.index.duplicated()]

or

df.reindex(df.index.drop_duplicates())

although these are even longer than the groupby ..

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

unchanged, the first is best (this is for the Index to be compatible)

@sinhrks
Copy link
Member Author

sinhrks commented Aug 13, 2014

@jorisvandenbossche 's point is #2825. Though I feel df[~df.index.duplicated()] is simple enough.

@jreback
Copy link
Contributor

jreback commented Aug 14, 2014

the doc string for Series.duplicated/drop_duplicated does not have inplace?

I would change this around. Why don't you just have _duplicated/_drop_duplicates in Base (w/o the inplace argument, and are private (no doc strings)).

Then in Index/Series put in the doc-strings (and inplace for Series)?

@sinhrks
Copy link
Member Author

sinhrks commented Aug 15, 2014

No, Series has. I expect docstrings are rendered as expected.
pandas index drop_duplicates pandas 0 14 1-201-g54d3e4d documentation 2014-08-15 21-40-19
pandas series drop_duplicates pandas 0 14 1-201-g54d3e4d documentation 2014-08-15 21-39-58

Agreed to remove docstring from IndexOpsMixin, and I feel no need to make them private (little confusing).

@jreback
Copy link
Contributor

jreback commented Aug 15, 2014

ok, that's fine then. ping hwne ready

@sinhrks
Copy link
Member Author

sinhrks commented Aug 15, 2014

Thanks to confirm. Now green.

jreback added a commit that referenced this pull request Aug 15, 2014
ENH: Add duplicated/drop_duplicates to Index
@jreback jreback merged commit b2d5a33 into pandas-dev:master Aug 15, 2014
@sinhrks sinhrks deleted the dup_idx branch August 15, 2014 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add duplicated method to Index classes
3 participants