ENH: ExtensionArray.unique #19869

TomAugspurger · 2018-02-23T17:26:26Z

No description provided.

TomAugspurger · 2018-02-23T17:27:40Z

This doesn't conflict with #19863 since unique returns an ExtensionArray.

jreback · 2018-02-23T17:30:12Z

pandas/core/arrays/base.py

+        from pandas import unique
+
+        uniques = unique(self.astype(object))
+        return type(self)(uniques)


i think you should add a
self._shallow_copy or something instead of doing this)

To avoid the copy or to avoid the type stuff?

In this specific case, I think a copy is necessary since uniques is an ndarray objects.

As for alternative constructors to avoid the type stuff, sure, just need to come up with names for them.

For my IPAddress stuff I want to be able to do zero-copy construction from

ExtensionArray

numpy structured array with the correct fields.

Putting those in separate constructors makes sense.

no its the idiom , see below, should have a _constructor in EA which just returns type(self), which is what we do in Index. this handled the sub-classing things.

@jreback What do you mean with "this handled the sub-classing things"

because everywhere else where use _constructor as the constructor class and do not use type(self) directly (yes that is the implementation, but the code all uses _constructor

I understand that, but it's not because we do it everywhere else internally that we need to do it here as well. Here we have something that is part of an external interface, and if adding something like that we should have a reason for it. And a clear message to the implementer what we expect from it.
Eg in Series/DataFrame you have interactions between different subclasses (a slice of DataFrame can give another Series subclass). That's not something we have to deal with here.

This just becomes confusing. I don't any reason not to have a _constructor, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.

You want me to change it to _constructor even though it'll be irrelevant after #19913 is resolved?

see my comments on that PR

so this now should change to _construct_from_sequence, in fact should be try to be concistent and always use this.

codecov · 2018-02-23T20:17:41Z

Codecov Report

Merging #19869 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19869      +/-   ##
==========================================
+ Coverage    91.7%    91.7%   +<.01%     
==========================================
  Files         150      150              
  Lines       49165    49168       +3     
==========================================
+ Hits        45087    45090       +3     
  Misses       4078     4078

Flag	Coverage Δ
#multiple	`90.09% <100%> (ø)`	⬆️
#single	`41.86% <50%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.17% <100%> (-0.01%)`	⬇️
pandas/core/arrays/base.py	`76.74% <100%> (+2.38%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfe6ebc...41dd128. Read the comment docs.

TomAugspurger · 2018-02-23T20:37:04Z

pandas/core/algorithms.py

@@ -356,7 +356,7 @@ def unique(values):
    # categorical is a fast-path
    # this will coerce Categorical, CategoricalIndex,
    # and category dtypes Series to same return of Category
-    if is_categorical_dtype(values):
+    if is_extension_array_dtype(values):
        values = getattr(values, '.values', values)


Ha, this getattr is doing nothing since it's '.values' and not 'values' :)

Ah, and we have a test that depends on it! I'm going to split this off to a new issue then.

Ah, nevermind. I think we can just remove it instead of correcting it.

jreback · 2018-02-24T14:31:48Z

@TomAugspurger before we move on, some 32-bit issues: https://travis-ci.org/MacPython/pandas-wheels/jobs/345517480

    def test_loc_frame(self, data):
        df = pd.DataFrame({"A": data, 'B': np.arange(len(data))})
        expected = pd.DataFrame({"A": data[:4]})

in pandas/tests/extension/base/getitem.py

you need to pass dtype='int64' when using np.arange

TomAugspurger · 2018-02-24T15:31:50Z

Thanks.

Do you get email alerts for the build in MacPython/pandas-wheels? I don't see anywhere to sign up for them.

Can you take a look at my maybe-fix for the test failures? IIUC the issue was that the DecimalArray does an np.ndarray.take(indexer). NumPy expected indexer.dtype to be np.intp on that system, but from pandas it'll always be np.int64?

jreback · 2018-02-24T15:32:49Z

yes i'll put up a fix. to get alerts you have to go to travis ci profile (yours), sync github then check the box on that repo, then wait a while :>

This reverts commit 5099573.

jreback · 2018-02-24T15:53:40Z

pandas/tests/extension/json/array.py

+    def unique(self):
+        # Parent method doesn't work since np.array will try to infer
+        # a 2-dim object.
+        return type(self)([


use self._constructor rather than type(self) generally

would change this too

see above comment, define _consturct_from_sequence

jreback · 2018-02-24T15:54:49Z

pandas/core/arrays/base.py

+        from pandas import unique
+
+        uniques = unique(self.astype(object))
+        return type(self)(uniques)


no its the idiom , see below, should have a _constructor in EA which just returns type(self), which is what we do in Index. this handled the sub-classing things.

jreback · 2018-02-25T21:02:56Z

pandas/core/arrays/base.py

+        from pandas import unique
+
+        uniques = unique(self.astype(object))
+        return type(self)(uniques)


because everywhere else where use _constructor as the constructor class and do not use type(self) directly (yes that is the implementation, but the code all uses _constructor

jreback

small changes, otherwise lgtm.

jreback · 2018-02-26T14:16:08Z

pandas/tests/extension/json/array.py

+    def unique(self):
+        # Parent method doesn't work since np.array will try to infer
+        # a 2-dim object.
+        return type(self)([


would change this too

jreback · 2018-02-26T14:16:19Z

pandas/core/arrays/base.py

+        Returns
+        -------
+        ExtensionArray
+        """


prob other locations that can be changed to use this

The only other type(self) in this file is to get the class name.

TomAugspurger · 2018-02-26T14:16:37Z

I added _constructor, but I think it and the old way of doing type(self)(arg) were a mistake.
#19906

jorisvandenbossche · 2018-02-26T14:29:38Z

I added a comment saying we should have a better reason (and specify this in the docs, as now it doesn't say this) for adding something like _constructor, but now I see the new issue you opened, and this is actually good reason :)
As I was previously complaining for geopandas that the requirement that GeometryArray() should be able to take a sequence of scalars was a bit annoying (but certainly hackable). But this way we could indeed keep the main constructor more free to the extension author.

TomAugspurger · 2018-02-26T14:31:48Z

👍 FWIW, this PR is blocking factorize (and this groupby) so I just want to get this one merged 😆

A concrete example in #19906 would be very helpful.

jorisvandenbossche · 2018-02-26T14:35:28Z

To move forward quickly with this PR, I would personally just leave it as type(self)(..) for now, as that is perfectly valid code and more conservative with regard to extending the interface.
And keep potentially adding _constructor (or variants proposed in #19906) for another PR, where we could discuss in more detail what is exactly expected from such constructor.

This reverts commit 011d02e.

TomAugspurger · 2018-02-26T14:37:14Z

OK, reverted.

If we get #19906 sketched out I can implement it later today, so we'll only have the type(self)(...) stuff in master for a day or two.

jreback · 2018-02-27T01:28:02Z

pandas/core/arrays/base.py

+        from pandas import unique
+
+        uniques = unique(self.astype(object))
+        return type(self)(uniques)


This just becomes confusing. I don't any reason not to have a _constructor, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.

jreback · 2018-03-04T20:17:21Z

pandas/core/arrays/base.py

+        from pandas import unique
+
+        uniques = unique(self.astype(object))
+        return type(self)(uniques)


so this now should change to _construct_from_sequence, in fact should be try to be concistent and always use this.

jreback · 2018-03-04T20:18:00Z

pandas/tests/extension/json/array.py

+    def unique(self):
+        # Parent method doesn't work since np.array will try to infer
+        # a 2-dim object.
+        return type(self)([


see above comment, define _consturct_from_sequence

jorisvandenbossche

Looks good, apart from the one comment

jorisvandenbossche · 2018-03-06T15:35:25Z

pandas/tests/extension/base/methods.py

+    @pytest.mark.parametrize('box', [pd.Series, lambda x: x])
+    @pytest.mark.parametrize('method', [lambda x: x.unique(), pd.unique])
+    def test_unique(self, data, box, method):
+        duplicated = box(type(data)([data[0], data[0]]))


should this one also be _constructor_from_sequence? (as it is a generic test to be subclassed)

jreback · 2018-03-13T10:17:06Z

pandas/core/arrays/base.py

+        """Compute the ExtensionArray of unique values.
+
+        Returns
+        -------


future PR should prob add some examples here :> (and other doc-strings).

True :) The only problem is that for ExtensionArray we don't have a direct working example, as you first need to subclass it (unless we use one of the existing ones like Categorical, but that also seems a bit strange)

jreback · 2018-03-13T10:17:22Z

thanks!

ENH: ExtensionArray.unique

7267544

TomAugspurger mentioned this pull request Feb 23, 2018

ExtensionArray meta-issue #19696

Closed

15 tasks

TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Feb 23, 2018

TomAugspurger added this to the 0.23.0 milestone Feb 23, 2018

TomAugspurger added the Dtype Conversions Unexpected or buggy dtype conversions label Feb 23, 2018

jreback reviewed Feb 23, 2018

View reviewed changes

Linting

07148db

TomAugspurger mentioned this pull request Feb 23, 2018

ENH: Factorize and unique ContinuumIO/cyberpandas#9

Merged

TomAugspurger commented Feb 23, 2018

View reviewed changes

Update comment, remove buggy line

c8b5852

Fixed 32-bit test failures

5099573

Revert "Fixed 32-bit test failures"

a5d6b67

This reverts commit 5099573.

jreback requested changes Feb 24, 2018

View reviewed changes

jorisvandenbossche approved these changes Feb 25, 2018

View reviewed changes

jreback requested changes Feb 25, 2018

View reviewed changes

Added _constructor

011d02e

jreback requested changes Feb 26, 2018

View reviewed changes

Revert "Added _constructor"

b8711d3

This reverts commit 011d02e.

jreback requested changes Feb 27, 2018

View reviewed changes

TomAugspurger mentioned this pull request Feb 27, 2018

API: Added ExtensionArray constructors #19913

Merged

jreback requested changes Mar 4, 2018

View reviewed changes

TomAugspurger added 2 commits March 5, 2018 15:25

Merge remote-tracking branch 'upstream/master' into fu1+unique

c15d42d

Updated

a260d35

jorisvandenbossche approved these changes Mar 6, 2018

View reviewed changes

TomAugspurger added 3 commits March 12, 2018 09:16

Merge remote-tracking branch 'upstream/master' into fu1+unique

51f8a27

Use from_sequence

fc04612

Merge remote-tracking branch 'upstream/master' into fu1+unique

41dd128

jreback approved these changes Mar 13, 2018

View reviewed changes

jreback reviewed Mar 13, 2018

View reviewed changes

jreback merged commit c748de0 into pandas-dev:master Mar 13, 2018

TomAugspurger deleted the fu1+unique branch May 2, 2018 13:10

ENH: ExtensionArray.unique #19869

ENH: ExtensionArray.unique #19869

Conversation

TomAugspurger commented Feb 23, 2018

TomAugspurger commented Feb 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Feb 23, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2018

TomAugspurger commented Feb 24, 2018

jreback commented Feb 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Feb 26, 2018

jorisvandenbossche commented Feb 26, 2018

TomAugspurger commented Feb 26, 2018 • edited Loading

jorisvandenbossche commented Feb 26, 2018

TomAugspurger commented Feb 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 13, 2018

codecov bot commented Feb 23, 2018 •

edited

Loading

TomAugspurger commented Feb 26, 2018 •

edited

Loading