ENH: Add duplicated/drop_duplicates to Index #7979

sinhrks · 2014-08-10T15:25:48Z

Closes #4060.

idx = pd.Index([1, 2, 3, 4, 1, 2])
idx.duplicated()
# Index([False, False, False, False, True, True], dtype='bool')
idx.drop_duplicates()
# Int64Index([1, 2, 3, 4], dtype='int64')

idx.duplicated(take_last=True)
# Index([True, True, False, False, False, False], dtype='bool')
idx.drop_duplicates(take_last=True)
# Int64Index([3, 4, 1, 2], dtype='int64')

jreback · 2014-08-10T16:00:11Z

pandas/core/base.py

@@ -443,6 +444,53 @@ def searchsorted(self, key, side='left'):
        #### needs tests/doc-string
        return self.values.searchsorted(key, side=side)

+    def drop_duplicates(self, take_last=False, inplace=False):
+        """


these need to raise if inplace and it's anindex (as they are immutable)

side note - can u audit existing methods in indexOps for using inplace

There seems to be no func accepts inplace other than this.

ok, gr8. still I think putting the check on update_inplace might be good

jreback · 2014-08-11T12:47:15Z

pandas/core/base.py

+
+        if inplace:
+            from pandas.core.index import Index
+            if isinstance(self, Index):


I think better is to have update_inplace in core/base.py that simply raises if its an Index (I think this would be overriden by the update_inplace in core/generic.py and so other sub-classes won't see it

Sure. I think adding update_inplace to Index is clearer?

ahh yes, that would be better (though maybe add as a NotIMplemented to OpsMixIn just as a place holder for the abstract methdos)

jorisvandenbossche · 2014-08-11T18:39:27Z

pandas/core/base.py

@@ -469,6 +470,54 @@ def searchsorted(self, key, side='left'):
        #### needs tests/doc-string
        return self.values.searchsorted(key, side=side)

+    def drop_duplicates(self, take_last=False, inplace=False):
+        """
+        Return Series or Index with duplicate values removed


About the Series or Index, could you do something like in generic.py with substitution of klass name so that only Series or Index shows up in the respective docstring?

sinhrks · 2014-08-12T13:23:41Z

@jreback, @jorisvandenbossche Considering both comments and fixed.

Defining update_inplace in IndexOpsMixin results in Series to refer it because of inheritance order. Thus, functions which uses update_inplace must be overridden eventually.
https://github.com/pydata/pandas/blob/master/pandas/core/series.py#L77

So defined common logic in IndexOpsMixin and override in Index and Series each. As a result, Index no longer need update_inplace because it will not accept inplace keyword.

jreback · 2014-08-12T13:30:34Z

pandas/core/base.py

+        try:
+            return self._constructor(duplicated,
+                                     index=self.index).__finalize__(self)
+        except AttributeError:


this is very awkward to do. Maybe just put the immutable definition in base and override the definition in series. prob simpler?

OK, fixed to centralize the logic to IndexOpsMixin. Even though update_inplace is defined in both IndexOpsMixin and Index, it will never called in drop_duplicates case (Index.drop_duplicates blocks inplace kw, and it is better for proper docstring)

jorisvandenbossche · 2014-08-12T20:07:51Z

Just as a usage question, what do we envisage as the 'recommended' way to drop duplicate indices from a DataFrame (where you now had to say the somewhat unintuitive df.groupby(level=0).first())?:

df[~df.index.duplicated()]

or

df.reindex(df.index.drop_duplicates())

although these are even longer than the groupby ..

jreback · 2014-08-12T20:08:40Z

unchanged, the first is best (this is for the Index to be compatible)

sinhrks · 2014-08-13T02:27:57Z

@jorisvandenbossche 's point is #2825. Though I feel df[~df.index.duplicated()] is simple enough.

jreback · 2014-08-14T17:31:35Z

the doc string for Series.duplicated/drop_duplicated does not have inplace?

I would change this around. Why don't you just have _duplicated/_drop_duplicates in Base (w/o the inplace argument, and are private (no doc strings)).

Then in Index/Series put in the doc-strings (and inplace for Series)?

sinhrks · 2014-08-15T12:45:59Z

No, Series has. I expect docstrings are rendered as expected.

Agreed to remove docstring from IndexOpsMixin, and I feel no need to make them private (little confusing).

jreback · 2014-08-15T12:49:36Z

ok, that's fine then. ping hwne ready

sinhrks · 2014-08-15T12:54:16Z

Thanks to confirm. Now green.

ENH: Add duplicated/drop_duplicates to Index

jreback reviewed Aug 10, 2014
View reviewed changes

jreback added API Design labels Aug 11, 2014

jreback added this to the 0.15.0 milestone Aug 11, 2014

jreback reviewed Aug 11, 2014
View reviewed changes

jreback mentioned this pull request Aug 11, 2014

API/CLN: more common ops to integrate with Series/index OpsMixin #6382

Closed

17 tasks

jorisvandenbossche reviewed Aug 11, 2014
View reviewed changes

jreback reviewed Aug 12, 2014
View reviewed changes

ENH: Add duplicated/drop_duplicates to Index

54d3e4d

jreback added a commit that referenced this pull request Aug 15, 2014

Merge pull request #7979 from sinhrks/dup_idx

b2d5a33

ENH: Add duplicated/drop_duplicates to Index

jreback merged commit b2d5a33 into pandas-dev:master Aug 15, 2014

sinhrks deleted the dup_idx branch August 15, 2014 13:01

sinhrks mentioned this pull request Nov 22, 2014

API: Index should support __inverse__ ops #8875

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add duplicated/drop_duplicates to Index #7979

ENH: Add duplicated/drop_duplicates to Index #7979

sinhrks commented Aug 10, 2014

jreback Aug 10, 2014

jreback Aug 10, 2014

sinhrks Aug 11, 2014

jreback Aug 11, 2014

jreback Aug 11, 2014

sinhrks Aug 11, 2014

jreback Aug 11, 2014

jorisvandenbossche Aug 11, 2014

sinhrks commented Aug 12, 2014

jreback Aug 12, 2014

sinhrks Aug 13, 2014

jorisvandenbossche commented Aug 12, 2014

jreback commented Aug 12, 2014

sinhrks commented Aug 13, 2014

jreback commented Aug 14, 2014

sinhrks commented Aug 15, 2014

jreback commented Aug 15, 2014

sinhrks commented Aug 15, 2014

ENH: Add duplicated/drop_duplicates to Index #7979

ENH: Add duplicated/drop_duplicates to Index #7979

Conversation

sinhrks commented Aug 10, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinhrks commented Aug 12, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Aug 12, 2014

jreback commented Aug 12, 2014

sinhrks commented Aug 13, 2014

jreback commented Aug 14, 2014

sinhrks commented Aug 15, 2014

jreback commented Aug 15, 2014

sinhrks commented Aug 15, 2014