Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: argsort behaviour for ExtensionArray with missing values #21801

Closed
jorisvandenbossche opened this issue Jul 7, 2018 · 13 comments
Closed
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@jorisvandenbossche
Copy link
Member

Currently we don't specify what the behaviour should be for ExtensionArray.argsort when there are missing values.
This is not a huge problem because the Series.sort_values deals with the missing values itself (only argsorts the non-missing data), but still we should pin down and test the behaviour.

I suppose we should follow numpy's example here and put them last (which is also consistent with the default behaviour of sort_values):

In [114]: a = np.array([1, 3, 2, np.nan, 4, np.nan])

In [115]: a.argsort()
Out[115]: array([0, 2, 1, 4, 3, 5])

@jorisvandenbossche
Copy link
Member Author

Some examples on this

Numpy sorts NaNs last:

In [5]: a = np.array([2, 1, np.nan, 2])

In [6]: a.argsort()
Out[6]: array([1, 0, 3, 2])

In [7]: a[a.argsort()] 
Out[7]: array([ 1.,  2.,  2., nan])

Categorical already existed for a long time, and has this "NaN first" behaviour

In [8]: a = pd.Categorical(['b', 'a', np.nan, 'b'])   

In [9]: a.argsort()       
Out[9]: array([2, 1, 0, 3])

In [10]: a[a.argsort()]      
Out[10]: 
[NaN, a, b, b]
Categories (2, object): [a, b]

Inconsistent here is that Categorical also has a sort_values method (which other EAs don't have) which puts NaNs last by default:

In [11]: a.sort_values()     
Out[11]: 
[a, b, b, NaN]
Categories (2, object): [a, b]

The new IntegerArray (and also the datetimelike arrays) copied this behaviour from Categorical:

In [12]: a = pd.array([2, 1, np.nan, 2], dtype='Int64')

In [13]: a.argsort()  
Out[13]: array([2, 1, 0, 3])

In [14]: a[a.argsort()]       
Out[14]: 
<IntegerArray>
[NaN, 1, 2, 2]
Length: 4, dtype: Int64

On the other hand, IntervalArray currently errors on it:

In [20]: a= pd.arrays.IntervalArray.from_tuples([(1, 2), (0, 1), np.nan, (1, 2)]) 

In [21]: a.argsort()  
...
TypeError: unorderable types: Interval() > float()

@jorisvandenbossche
Copy link
Member Author

I would say ideally, we would follow the "NaN last" behaviour (consistent with numpy's argsort + consistent with the default of sort_values in pandas).

The main question would be how to get there: can we just change this? (that would be a hard breaking change, but maybe we can do it for the recent EAs?) How could we deprecate it? When deprecating, we want a way for users to already get the future behaviour (but adding a keyword only for the deprecation (which later needs to be deprecated itself) is also ugly ..).

One option could be to add a na_position keyword (-> None by default (current default), will change to 'last' in the future).
One disadvantage is that this makes the EA interface (what EA authors need to implement) more complex, as the argsort method needs to handle this additional case. And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

cc @jreback @TomAugspurger

@makbigc
Copy link
Contributor

makbigc commented May 12, 2019

When nan is invloved, argsort behaves in two ways.

The element of index 1 is always nan in the following.

Putting the nan in the beginning

In [27]: arr = integer_array([1, np.nan, 2])

In [28]: arr
Out[28]:

[1, NaN, 2]
Length: 3, dtype: Int64

In [29]: arr.argsort()
Out[29]: array([1, 0, 2])

Putting the nan in the end

In [20]: arr1 = integer_array([1, np.nan, 0], dtype='uint8')

In [21]: arr1
Out[21]:

[1, NaN, 0]
Length: 3, dtype: UInt8

In [22]: arr1.argsort()
Out[22]: array([2, 0, 1])

In [23]: idx = pd.Index([1, np.nan, 2])

In [24]: arr = idx.array

In [25]: arr
Out[25]:

[1.0, nan, 2.0]
Length: 3, dtype: float64

In [26]: arr.argsort()
Out[26]: array([0, 2, 1])

Should we standardize where nan to be placed? _values_for_argsort returns different value for nan in different EA.

@TomAugspurger
Copy link
Contributor

Still thinking through this... Right now, I'd prefer to not just break things by moving NaN to the end. So I think my preferred route is a keyword to argsort, but I could be convinced otherwise.

Specifically on

And another disadvantage, is that this makes the _values_for_argsort option of the interface quite impossible to do (as the values might need to be different depending on whether the NAs sort first or last)

We could document that _values_for_argsort is supposed to follow the convention that NaNs are sorted last. This may cause silent failures for 3rd party EA authors who currently behave differently (though we can ping the ones we know about on this issue).

@jorisvandenbossche
Copy link
Member Author

My main hesitation to a keyword long-term is that it forces you to actually implement both behaviours, making the required argsort more complex.

If we would not want to keep the keyword long term, it could also be an option to only add it to Categorical (where we probably need to deprecate it), but do it as a breaking change to the other recent EAs

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche this is probably a blocker for the RC, since we want to make a breaking change? I can maybe put up a PR tonight.

@jorisvandenbossche
Copy link
Member Author

Do you mean the Categorical part? (as the EA part is covered by #27137 ?)

@TomAugspurger
Copy link
Contributor

Yep.

@jorisvandenbossche
Copy link
Member Author

Did we already decide how to go about it?
Simply break, or add a keyword?

@TomAugspurger
Copy link
Contributor

Simple break IMO.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2019

agree

@jorisvandenbossche
Copy link
Member Author

OK

@jreback jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019
@TomAugspurger
Copy link
Contributor

I think that everything here has been resolved by #27137

@TomAugspurger TomAugspurger modified the milestones: 1.0, 0.25.0 Dec 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants