Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ExtensionArray.unique #19869

Merged
merged 12 commits into from
Mar 13, 2018
Merged

ENH: ExtensionArray.unique #19869

merged 12 commits into from
Mar 13, 2018

Conversation

TomAugspurger
Copy link
Contributor

No description provided.

@TomAugspurger TomAugspurger mentioned this pull request Feb 23, 2018
15 tasks
@TomAugspurger
Copy link
Contributor Author

This doesn't conflict with #19863 since unique returns an ExtensionArray.

@TomAugspurger TomAugspurger added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Feb 23, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Feb 23, 2018
@TomAugspurger TomAugspurger added the Dtype Conversions Unexpected or buggy dtype conversions label Feb 23, 2018
from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you should add a
self._shallow_copy or something instead of doing this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid the copy or to avoid the type stuff?

In this specific case, I think a copy is necessary since uniques is an ndarray objects.

As for alternative constructors to avoid the type stuff, sure, just need to come up with names for them.

For my IPAddress stuff I want to be able to do zero-copy construction from

  • ExtensionArray
  • numpy structured array with the correct fields.

Putting those in separate constructors makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its the idiom , see below, should have a _constructor in EA which just returns type(self), which is what we do in Index. this handled the sub-classing things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback What do you mean with "this handled the sub-classing things"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because everywhere else where use _constructor as the constructor class and do not use type(self) directly (yes that is the implementation, but the code all uses _constructor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but it's not because we do it everywhere else internally that we need to do it here as well. Here we have something that is part of an external interface, and if adding something like that we should have a reason for it. And a clear message to the implementer what we expect from it.
Eg in Series/DataFrame you have interactions between different subclasses (a slice of DataFrame can give another Series subclass). That's not something we have to deal with here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just becomes confusing. I don't any reason not to have a _constructor, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want me to change it to _constructor even though it'll be irrelevant after #19913 is resolved?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments on that PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this now should change to _construct_from_sequence, in fact should be try to be concistent and always use this.

@codecov
Copy link

codecov bot commented Feb 23, 2018

Codecov Report

Merging #19869 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19869      +/-   ##
==========================================
+ Coverage    91.7%    91.7%   +<.01%     
==========================================
  Files         150      150              
  Lines       49165    49168       +3     
==========================================
+ Hits        45087    45090       +3     
  Misses       4078     4078
Flag Coverage Δ
#multiple 90.09% <100%> (ø) ⬆️
#single 41.86% <50%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.17% <100%> (-0.01%) ⬇️
pandas/core/arrays/base.py 76.74% <100%> (+2.38%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bfe6ebc...41dd128. Read the comment docs.

@@ -356,7 +356,7 @@ def unique(values):
# categorical is a fast-path
# this will coerce Categorical, CategoricalIndex,
# and category dtypes Series to same return of Category
if is_categorical_dtype(values):
if is_extension_array_dtype(values):
values = getattr(values, '.values', values)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, this getattr is doing nothing since it's '.values' and not 'values' :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, and we have a test that depends on it! I'm going to split this off to a new issue then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind. I think we can just remove it instead of correcting it.

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

@TomAugspurger before we move on, some 32-bit issues: https://travis-ci.org/MacPython/pandas-wheels/jobs/345517480

    def test_loc_frame(self, data):
        df = pd.DataFrame({"A": data, 'B': np.arange(len(data))})
        expected = pd.DataFrame({"A": data[:4]})

in pandas/tests/extension/base/getitem.py

you need to pass dtype='int64' when using np.arange

@TomAugspurger
Copy link
Contributor Author

Thanks.

Do you get email alerts for the build in MacPython/pandas-wheels? I don't see anywhere to sign up for them.

Can you take a look at my maybe-fix for the test failures? IIUC the issue was that the DecimalArray does an np.ndarray.take(indexer). NumPy expected indexer.dtype to be np.intp on that system, but from pandas it'll always be np.int64?

@jreback
Copy link
Contributor

jreback commented Feb 24, 2018

yes i'll put up a fix. to get alerts you have to go to travis ci profile (yours), sync github then check the box on that repo, then wait a while :>

def unique(self):
# Parent method doesn't work since np.array will try to infer
# a 2-dim object.
return type(self)([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._constructor rather than type(self) generally

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would change this too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment, define _consturct_from_sequence

from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no its the idiom , see below, should have a _constructor in EA which just returns type(self), which is what we do in Index. this handled the sub-classing things.

from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because everywhere else where use _constructor as the constructor class and do not use type(self) directly (yes that is the implementation, but the code all uses _constructor

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes, otherwise lgtm.

def unique(self):
# Parent method doesn't work since np.array will try to infer
# a 2-dim object.
return type(self)([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would change this too

Returns
-------
ExtensionArray
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob other locations that can be changed to use this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other type(self) in this file is to get the class name.

@TomAugspurger
Copy link
Contributor Author

I added _constructor, but I think it and the old way of doing type(self)(arg) were a mistake.
#19906

@jorisvandenbossche
Copy link
Member

I added a comment saying we should have a better reason (and specify this in the docs, as now it doesn't say this) for adding something like _constructor, but now I see the new issue you opened, and this is actually good reason :)
As I was previously complaining for geopandas that the requirement that GeometryArray() should be able to take a sequence of scalars was a bit annoying (but certainly hackable). But this way we could indeed keep the main constructor more free to the extension author.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Feb 26, 2018

👍 FWIW, this PR is blocking factorize (and this groupby) so I just want to get this one merged 😆

A concrete example in #19906 would be very helpful.

@jorisvandenbossche
Copy link
Member

To move forward quickly with this PR, I would personally just leave it as type(self)(..) for now, as that is perfectly valid code and more conservative with regard to extending the interface.
And keep potentially adding _constructor (or variants proposed in #19906) for another PR, where we could discuss in more detail what is exactly expected from such constructor.

This reverts commit 011d02e.
@TomAugspurger
Copy link
Contributor Author

OK, reverted.

If we get #19906 sketched out I can implement it later today, so we'll only have the type(self)(...) stuff in master for a day or two.

from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just becomes confusing. I don't any reason not to have a _constructor, its obvious, consistent in the code. These ad-hoc breaks with internal consistency are just creating technical debt. Please let's have consistentcy.

from pandas import unique

uniques = unique(self.astype(object))
return type(self)(uniques)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this now should change to _construct_from_sequence, in fact should be try to be concistent and always use this.

def unique(self):
# Parent method doesn't work since np.array will try to infer
# a 2-dim object.
return type(self)([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above comment, define _consturct_from_sequence

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, apart from the one comment

@pytest.mark.parametrize('box', [pd.Series, lambda x: x])
@pytest.mark.parametrize('method', [lambda x: x.unique(), pd.unique])
def test_unique(self, data, box, method):
duplicated = box(type(data)([data[0], data[0]]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this one also be _constructor_from_sequence? (as it is a generic test to be subclassed)

"""Compute the ExtensionArray of unique values.

Returns
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future PR should prob add some examples here :> (and other doc-strings).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True :) The only problem is that for ExtensionArray we don't have a direct working example, as you first need to subclass it (unless we use one of the existing ones like Categorical, but that also seems a bit strange)

@jreback jreback merged commit c748de0 into pandas-dev:master Mar 13, 2018
@jreback
Copy link
Contributor

jreback commented Mar 13, 2018

thanks!

@TomAugspurger TomAugspurger deleted the fu1+unique branch May 2, 2018 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants