-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: ExtensionArray.unique #19869
ENH: ExtensionArray.unique #19869
Conversation
This doesn't conflict with #19863 since unique returns an ExtensionArray. |
pandas/core/arrays/base.py
Outdated
from pandas import unique | ||
|
||
uniques = unique(self.astype(object)) | ||
return type(self)(uniques) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you should add a
self._shallow_copy or something instead of doing this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the copy or to avoid the type
stuff?
In this specific case, I think a copy is necessary since uniques
is an ndarray objects.
As for alternative constructors to avoid the type
stuff, sure, just need to come up with names for them.
For my IPAddress stuff I want to be able to do zero-copy construction from
- ExtensionArray
- numpy structured array with the correct fields.
Putting those in separate constructors makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no its the idiom , see below, should have a _constructor
in EA which just returns type(self)
, which is what we do in Index. this handled the sub-classing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback What do you mean with "this handled the sub-classing things"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because everywhere else where use _constructor
as the constructor class and do not use type(self)
directly (yes that is the implementation, but the code all uses _constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, but it's not because we do it everywhere else internally that we need to do it here as well. Here we have something that is part of an external interface, and if adding something like that we should have a reason for it. And a clear message to the implementer what we expect from it.
Eg in Series/DataFrame you have interactions between different subclasses (a slice of DataFrame can give another Series subclass). That's not something we have to deal with here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just becomes confusing. I don't any reason not to have a _constructor
, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want me to change it to _constructor
even though it'll be irrelevant after #19913 is resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments on that PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this now should change to _construct_from_sequence
, in fact should be try to be concistent and always use this.
Codecov Report
@@ Coverage Diff @@
## master #19869 +/- ##
==========================================
+ Coverage 91.7% 91.7% +<.01%
==========================================
Files 150 150
Lines 49165 49168 +3
==========================================
+ Hits 45087 45090 +3
Misses 4078 4078
Continue to review full report at Codecov.
|
pandas/core/algorithms.py
Outdated
@@ -356,7 +356,7 @@ def unique(values): | |||
# categorical is a fast-path | |||
# this will coerce Categorical, CategoricalIndex, | |||
# and category dtypes Series to same return of Category | |||
if is_categorical_dtype(values): | |||
if is_extension_array_dtype(values): | |||
values = getattr(values, '.values', values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, this getattr
is doing nothing since it's '.values'
and not 'values'
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, and we have a test that depends on it! I'm going to split this off to a new issue then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nevermind. I think we can just remove it instead of correcting it.
@TomAugspurger before we move on, some 32-bit issues: https://travis-ci.org/MacPython/pandas-wheels/jobs/345517480
in pandas/tests/extension/base/getitem.py you need to pass |
Thanks. Do you get email alerts for the build in MacPython/pandas-wheels? I don't see anywhere to sign up for them. Can you take a look at my maybe-fix for the test failures? IIUC the issue was that the |
yes i'll put up a fix. to get alerts you have to go to travis ci profile (yours), sync github then check the box on that repo, then wait a while :> |
This reverts commit 5099573.
def unique(self): | ||
# Parent method doesn't work since np.array will try to infer | ||
# a 2-dim object. | ||
return type(self)([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self._constructor
rather than type(self)
generally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would change this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment, define _consturct_from_sequence
pandas/core/arrays/base.py
Outdated
from pandas import unique | ||
|
||
uniques = unique(self.astype(object)) | ||
return type(self)(uniques) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no its the idiom , see below, should have a _constructor
in EA which just returns type(self)
, which is what we do in Index. this handled the sub-classing things.
pandas/core/arrays/base.py
Outdated
from pandas import unique | ||
|
||
uniques = unique(self.astype(object)) | ||
return type(self)(uniques) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because everywhere else where use _constructor
as the constructor class and do not use type(self)
directly (yes that is the implementation, but the code all uses _constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes, otherwise lgtm.
def unique(self): | ||
# Parent method doesn't work since np.array will try to infer | ||
# a 2-dim object. | ||
return type(self)([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would change this too
pandas/core/arrays/base.py
Outdated
Returns | ||
------- | ||
ExtensionArray | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob other locations that can be changed to use this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other type(self)
in this file is to get the class name.
I added |
I added a comment saying we should have a better reason (and specify this in the docs, as now it doesn't say this) for adding something like |
👍 FWIW, this PR is blocking factorize (and this groupby) so I just want to get this one merged 😆 A concrete example in #19906 would be very helpful. |
To move forward quickly with this PR, I would personally just leave it as |
This reverts commit 011d02e.
OK, reverted. If we get #19906 sketched out I can implement it later today, so we'll only have the |
pandas/core/arrays/base.py
Outdated
from pandas import unique | ||
|
||
uniques = unique(self.astype(object)) | ||
return type(self)(uniques) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just becomes confusing. I don't any reason not to have a _constructor
, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.
pandas/core/arrays/base.py
Outdated
from pandas import unique | ||
|
||
uniques = unique(self.astype(object)) | ||
return type(self)(uniques) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this now should change to _construct_from_sequence
, in fact should be try to be concistent and always use this.
def unique(self): | ||
# Parent method doesn't work since np.array will try to infer | ||
# a 2-dim object. | ||
return type(self)([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above comment, define _consturct_from_sequence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, apart from the one comment
@pytest.mark.parametrize('box', [pd.Series, lambda x: x]) | ||
@pytest.mark.parametrize('method', [lambda x: x.unique(), pd.unique]) | ||
def test_unique(self, data, box, method): | ||
duplicated = box(type(data)([data[0], data[0]])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this one also be _constructor_from_sequence
? (as it is a generic test to be subclassed)
"""Compute the ExtensionArray of unique values. | ||
|
||
Returns | ||
------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future PR should prob add some examples here :> (and other doc-strings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True :) The only problem is that for ExtensionArray we don't have a direct working example, as you first need to subclass it (unless we use one of the existing ones like Categorical, but that also seems a bit strange)
thanks! |
No description provided.