ENH: support 'duplicated' functionality for ExtensionArrays #27264

jorisvandenbossche · 2019-07-06T17:31:09Z

For the factorize, unique, groupby hashtable-based functionalities, we included a _values_for_factorize() / factorize() method on the ExtensionArray. So for those methods, it is working nicely. However, for some of the other hashtable-based methods such as duplicated() or drop_duplicates, this machinery is not used and the EA is still coerced to a numpy array before passing to the algos code.

Small illustration that this is the fact by patching the IntegerArray to print when being coerced to a numpy array:

--- a/pandas/core/arrays/integer.py
+++ b/pandas/core/arrays/integer.py
@@ -364,6 +364,7 @@ class IntegerArray(ExtensionArray, ExtensionOpsMixin):
         the array interface, return my values
         We return an object array here to preserve our scalar values
         """
+        print("getting coerced to an array")
         return self._coerce_to_ndarray()

In [2]: s = pd.Series([1, 2, 1, 2, None], dtype='Int64') 

In [3]: s
Out[3]: getting coerced to an array

0      1
1      2
2      1
3      2
4    NaN
dtype: Int64

In [4]: s.duplicated()
getting coerced to an array
getting coerced to an array
Out[4]: 
0    False
1    False
2     True
3     True
4    False
dtype: bool

In [5]: s.unique()
Out[5]: 
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2019-07-06T21:23:40Z

So the idea is duplicated would use values for factorize?

jorisvandenbossche · 2019-07-06T22:37:50Z

Yes, I think that is the general idea. There might be some complications in practice, didn't look enough in detail to be sure.

jbrockmendel · 2021-12-27T22:50:38Z

So the idea is duplicated would use values for factorize?

Yes, I think that is the general idea. There might be some complications in practice, didn't look enough in detail to be sure.

Revisiting this, I think we definitely dont want to implement in terms of values_for_factorize, but could implement in terms of factorize

jorisvandenbossche added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 6, 2019

jbrockmendel mentioned this issue Mar 10, 2020

EA: revisit interface #32586

Closed

mroeschke added the duplicated duplicated, drop_duplicates label Jul 10, 2021

jbrockmendel mentioned this issue Jul 27, 2023

ENH: Add duplicated to MaskedArray/ExtensionArray interface #48424

Closed

3 tasks

jbrockmendel mentioned this issue Sep 24, 2023

ENH/PERF: add ExtensionArray.duplicated #55255

Merged

7 tasks

mroeschke closed this as completed in #55255 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: support 'duplicated' functionality for ExtensionArrays #27264

ENH: support 'duplicated' functionality for ExtensionArrays #27264

jorisvandenbossche commented Jul 6, 2019

jbrockmendel commented Jul 6, 2019

jorisvandenbossche commented Jul 6, 2019

jbrockmendel commented Dec 27, 2021

ENH: support 'duplicated' functionality for ExtensionArrays #27264

ENH: support 'duplicated' functionality for ExtensionArrays #27264

Comments

jorisvandenbossche commented Jul 6, 2019

jbrockmendel commented Jul 6, 2019

jorisvandenbossche commented Jul 6, 2019

jbrockmendel commented Dec 27, 2021