API: argsort behaviour for ExtensionArray with missing values #21801

jorisvandenbossche · 2018-07-07T22:07:01Z

Currently we don't specify what the behaviour should be for ExtensionArray.argsort when there are missing values.
This is not a huge problem because the Series.sort_values deals with the missing values itself (only argsorts the non-missing data), but still we should pin down and test the behaviour.

I suppose we should follow numpy's example here and put them last (which is also consistent with the default behaviour of sort_values):

In [114]: a = np.array([1, 3, 2, np.nan, 4, np.nan])

In [115]: a.argsort()
Out[115]: array([0, 2, 1, 4, 3, 5])

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-04-05T09:03:08Z

Some examples on this

Numpy sorts NaNs last:

In [5]: a = np.array([2, 1, np.nan, 2])

In [6]: a.argsort()
Out[6]: array([1, 0, 3, 2])

In [7]: a[a.argsort()] 
Out[7]: array([ 1.,  2.,  2., nan])

Categorical already existed for a long time, and has this "NaN first" behaviour

In [8]: a = pd.Categorical(['b', 'a', np.nan, 'b'])   

In [9]: a.argsort()       
Out[9]: array([2, 1, 0, 3])

In [10]: a[a.argsort()]      
Out[10]: 
[NaN, a, b, b]
Categories (2, object): [a, b]

Inconsistent here is that Categorical also has a sort_values method (which other EAs don't have) which puts NaNs last by default:

In [11]: a.sort_values()     
Out[11]: 
[a, b, b, NaN]
Categories (2, object): [a, b]

The new IntegerArray (and also the datetimelike arrays) copied this behaviour from Categorical:

In [12]: a = pd.array([2, 1, np.nan, 2], dtype='Int64')

In [13]: a.argsort()  
Out[13]: array([2, 1, 0, 3])

In [14]: a[a.argsort()]       
Out[14]: 
<IntegerArray>
[NaN, 1, 2, 2]
Length: 4, dtype: Int64

On the other hand, IntervalArray currently errors on it:

In [20]: a= pd.arrays.IntervalArray.from_tuples([(1, 2), (0, 1), np.nan, (1, 2)]) 

In [21]: a.argsort()  
...
TypeError: unorderable types: Interval() > float()

jorisvandenbossche · 2019-04-05T09:19:25Z

I would say ideally, we would follow the "NaN last" behaviour (consistent with numpy's argsort + consistent with the default of sort_values in pandas).

The main question would be how to get there: can we just change this? (that would be a hard breaking change, but maybe we can do it for the recent EAs?) How could we deprecate it? When deprecating, we want a way for users to already get the future behaviour (but adding a keyword only for the deprecation (which later needs to be deprecated itself) is also ugly ..).

One option could be to add a na_position keyword (-> None by default (current default), will change to 'last' in the future).
One disadvantage is that this makes the EA interface (what EA authors need to implement) more complex, as the argsort method needs to handle this additional case. And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

cc @jreback @TomAugspurger

makbigc · 2019-05-12T08:06:35Z

When nan is invloved, argsort behaves in two ways.

The element of index 1 is always nan in the following.

Putting the nan in the beginning

In [27]: arr = integer_array([1, np.nan, 2])

In [28]: arr
Out[28]:

[1, NaN, 2]
Length: 3, dtype: Int64

In [29]: arr.argsort()
Out[29]: array([1, 0, 2])

Putting the nan in the end

In [20]: arr1 = integer_array([1, np.nan, 0], dtype='uint8')

In [21]: arr1
Out[21]:

[1, NaN, 0]
Length: 3, dtype: UInt8

In [22]: arr1.argsort()
Out[22]: array([2, 0, 1])

In [23]: idx = pd.Index([1, np.nan, 2])

In [24]: arr = idx.array

In [25]: arr
Out[25]:

[1.0, nan, 2.0]
Length: 3, dtype: float64

In [26]: arr.argsort()
Out[26]: array([0, 2, 1])

Should we standardize where nan to be placed? _values_for_argsort returns different value for nan in different EA.

TomAugspurger · 2019-05-13T14:21:00Z

Still thinking through this... Right now, I'd prefer to not just break things by moving NaN to the end. So I think my preferred route is a keyword to argsort, but I could be convinced otherwise.

Specifically on

And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

We could document that _values_for_argsort is supposed to follow the convention that NaNs are sorted last. This may cause silent failures for 3rd party EA authors who currently behave differently (though we can ping the ones we know about on this issue).

jorisvandenbossche · 2019-05-13T15:43:28Z

My main hesitation to a keyword long-term is that it forces you to actually implement both behaviours, making the required argsort more complex.

If we would not want to keep the keyword long term, it could also be an option to only add it to Categorical (where we probably need to deprecate it), but do it as a breaking change to the other recent EAs

TomAugspurger · 2019-07-02T20:32:52Z

@jorisvandenbossche this is probably a blocker for the RC, since we want to make a breaking change? I can maybe put up a PR tonight.

jorisvandenbossche · 2019-07-02T21:07:58Z

Do you mean the Categorical part? (as the EA part is covered by #27137 ?)

TomAugspurger · 2019-07-02T21:09:53Z

Yep.

jorisvandenbossche · 2019-07-02T21:14:51Z

Did we already decide how to go about it?
Simply break, or add a keyword?

TomAugspurger · 2019-07-02T21:16:01Z

Simple break IMO.

jreback · 2019-07-02T21:16:52Z

agree

jorisvandenbossche · 2019-07-02T21:17:49Z

OK

TomAugspurger · 2019-12-30T15:49:22Z

I think that everything here has been resolved by #27137

jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jul 7, 2018

jorisvandenbossche mentioned this issue Jul 7, 2018

ENH: Integer NA Extension Array #21160

Merged

jorisvandenbossche mentioned this issue Mar 11, 2019

Suppress incorrect warning in nargsort for timezone-aware DatetimeIndex #25629

Merged

jorisvandenbossche added this to the 0.25.0 milestone Apr 5, 2019

jorisvandenbossche added the API Design label Apr 5, 2019

jorisvandenbossche mentioned this issue May 11, 2019

Implement min, argmin, max, argmax on ExtensionArrays? #24382

Closed

makbigc mentioned this issue May 12, 2019

API: ExtensionArray.argsort places the missing value at the end #26354

Closed

makbigc mentioned this issue Jul 2, 2019

API: ExtensionArray.argsort places the missing value at the end #27137

Merged

jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019

TomAugspurger closed this as completed Dec 30, 2019

TomAugspurger modified the milestones: 1.0, 0.25.0 Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: argsort behaviour for ExtensionArray with missing values #21801

API: argsort behaviour for ExtensionArray with missing values #21801

jorisvandenbossche commented Jul 7, 2018

jorisvandenbossche commented Apr 5, 2019

jorisvandenbossche commented Apr 5, 2019

makbigc commented May 12, 2019

TomAugspurger commented May 13, 2019

jorisvandenbossche commented May 13, 2019

TomAugspurger commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Jul 2, 2019

jreback commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Dec 30, 2019

API: argsort behaviour for ExtensionArray with missing values #21801

API: argsort behaviour for ExtensionArray with missing values #21801

Comments

jorisvandenbossche commented Jul 7, 2018

jorisvandenbossche commented Apr 5, 2019

jorisvandenbossche commented Apr 5, 2019

makbigc commented May 12, 2019

TomAugspurger commented May 13, 2019

jorisvandenbossche commented May 13, 2019

TomAugspurger commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Jul 2, 2019

jreback commented Jul 2, 2019

jorisvandenbossche commented Jul 2, 2019

TomAugspurger commented Dec 30, 2019