Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support 'duplicated' functionality for ExtensionArrays #27264

Closed
jorisvandenbossche opened this issue Jul 6, 2019 · 3 comments · Fixed by #55255
Closed

ENH: support 'duplicated' functionality for ExtensionArrays #27264

jorisvandenbossche opened this issue Jul 6, 2019 · 3 comments · Fixed by #55255
Labels
duplicated duplicated, drop_duplicates Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jorisvandenbossche
Copy link
Member

For the factorize, unique, groupby hashtable-based functionalities, we included a _values_for_factorize() / factorize() method on the ExtensionArray. So for those methods, it is working nicely. However, for some of the other hashtable-based methods such as duplicated() or drop_duplicates, this machinery is not used and the EA is still coerced to a numpy array before passing to the algos code.

Small illustration that this is the fact by patching the IntegerArray to print when being coerced to a numpy array:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -364,6 +364,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
         the array interface, return my values
         We return an object array here to preserve our scalar values
         """
+        print("getting coerced to an array")
         return self._coerce_to_ndarray()
In [2]: s = pd.Series([1, 2, 1, 2, None], dtype='Int64') 

In [3]: s
Out[3]: getting coerced to an array

0      1
1      2
2      1
3      2
4    NaN
dtype: Int64

In [4]: s.duplicated()
getting coerced to an array
getting coerced to an array
Out[4]: 
0    False
1    False
2     True
3     True
4    False
dtype: bool

In [5]: s.unique()
Out[5]: 
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64
@jorisvandenbossche jorisvandenbossche added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 6, 2019
@jbrockmendel
Copy link
Member

So the idea is duplicated would use values for factorize?

@jorisvandenbossche
Copy link
Member Author

Yes, I think that is the general idea. There might be some complications in practice, didn't look enough in detail to be sure.

@mroeschke mroeschke added the duplicated duplicated, drop_duplicates label Jul 10, 2021
@jbrockmendel
Copy link
Member

So the idea is duplicated would use values for factorize?

Yes, I think that is the general idea. There might be some complications in practice, didn't look enough in detail to be sure.

Revisiting this, I think we definitely dont want to implement in terms of values_for_factorize, but could implement in terms of factorize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicated duplicated, drop_duplicates Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants