-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: argsort behaviour for ExtensionArray with missing values #21801
Comments
Some examples on this Numpy sorts NaNs last:
Categorical already existed for a long time, and has this "NaN first" behaviour
Inconsistent here is that Categorical also has a
The new IntegerArray (and also the datetimelike arrays) copied this behaviour from Categorical:
On the other hand, IntervalArray currently errors on it:
|
I would say ideally, we would follow the "NaN last" behaviour (consistent with numpy's argsort + consistent with the default of The main question would be how to get there: can we just change this? (that would be a hard breaking change, but maybe we can do it for the recent EAs?) How could we deprecate it? When deprecating, we want a way for users to already get the future behaviour (but adding a keyword only for the deprecation (which later needs to be deprecated itself) is also ugly ..). One option could be to add a |
When nan is invloved, argsort behaves in two ways. The element of index 1 is always nan in the following.
In [27]: arr = integer_array([1, np.nan, 2]) In [28]: arr In [29]: arr.argsort()
In [20]: arr1 = integer_array([1, np.nan, 0], dtype='uint8') In [21]: arr1 In [22]: arr1.argsort() In [23]: idx = pd.Index([1, np.nan, 2]) In [24]: arr = idx.array In [25]: arr In [26]: arr.argsort() Should we standardize where nan to be placed? _values_for_argsort returns different value for nan in different EA. |
Still thinking through this... Right now, I'd prefer to not just break things by moving NaN to the end. So I think my preferred route is a keyword to Specifically on
We could document that |
My main hesitation to a keyword long-term is that it forces you to actually implement both behaviours, making the required If we would not want to keep the keyword long term, it could also be an option to only add it to Categorical (where we probably need to deprecate it), but do it as a breaking change to the other recent EAs |
@jorisvandenbossche this is probably a blocker for the RC, since we want to make a breaking change? I can maybe put up a PR tonight. |
Do you mean the Categorical part? (as the EA part is covered by #27137 ?) |
Yep. |
Did we already decide how to go about it? |
Simple break IMO. |
agree |
OK |
I think that everything here has been resolved by #27137 |
Currently we don't specify what the behaviour should be for
ExtensionArray.argsort
when there are missing values.This is not a huge problem because the
Series.sort_values
deals with the missing values itself (only argsorts the non-missing data), but still we should pin down and test the behaviour.I suppose we should follow numpy's example here and put them last (which is also consistent with the default behaviour of
sort_values
):The text was updated successfully, but these errors were encountered: