-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Sorting of ExtensionArrays #19957
Conversation
This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort
pandas/core/arrays/base.py
Outdated
kind : {'quicksort', 'mergesort', 'heapsort'}, optional | ||
Sorting algorithm. | ||
order : str or list of str, optional | ||
Included for NumPy compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this compatibility needed because in the code we use np.argsort(values)
which passes those keywords to the method?
(it is a bit unfortunate ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary, and I can remove it.
I see now that Categorical.argsort has a different signature. I suppose we should match that.
Codecov Report
@@ Coverage Diff @@
## master #19957 +/- ##
==========================================
+ Coverage 91.77% 91.78% +<.01%
==========================================
Files 152 152
Lines 49205 49223 +18
==========================================
+ Hits 45159 45177 +18
Misses 4046 4046
Continue to review full report at Codecov.
|
Changed how this is organized a bit, to reflect a pattern I noticed here an elsewhere. In several places (here, factorize, unique), a method like 1.) Data coercion / prep In the case of argsort, it's 1.) Just the array for most types, the codes for So I split the method in two I don't know how useful this will prove to be, but wanted to hear other's thoughts. |
Do you foresee similar patterns for other algos? Like |
Yes factorize would be another. Though it would be a bit more complicated.
I'll probably remove it for now. That means a bit more duplication, but
fewer levels of indirection.
…On Fri, Mar 2, 2018 at 10:13 AM, Joris Van den Bossche < ***@***.***> wrote:
Do you foresee similar patterns for other algos? Like
_values_for_factorize (not sure if that makes sense). Just to think about
if we would get a proliferation of such methods
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#19957 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsdiBiJkXrc6AUBR5y_d0GnEojQBks5taW9igaJpZM4SY-A0>
.
|
What is the use-case for writing your own sorting algorithm? Maybe radix sort when your data falls into pre-known categories? My inclination would be to only cinlude |
This reverts commit 44b6d72.
Do you mean not having an |
based on matching category values. Thus, this function can be | ||
called on an unordered Categorical instance unlike the functions | ||
'Categorical.min' and 'Categorical.max'. | ||
def argsort(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this changes our opinion on _values_for_argsort
, but the apparently Python2 has issues with passing through the arguments correctly to the super()
call.
____________________ TestCategoricalSort.test_numpy_argsort ____________________
self = <pandas.tests.categorical.test_sorting.TestCategoricalSort object at 0x7efcb391f950>
def test_numpy_argsort(self):
c = Categorical([5, 3, 1, 4, 2], ordered=True)
expected = np.array([2, 4, 1, 3, 0])
> tm.assert_numpy_array_equal(np.argsort(c), expected,
check_dtype=False)
pandas/tests/categorical/test_sorting.py:26:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../miniconda3/envs/pandas/lib/python2.7/site-packages/numpy/core/fromnumeric.py:886: in argsort
return argsort(axis, kind, order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = [5, 3, 1, 4, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
ascending = -1, kind = 'quicksort', args = (None,), kwargs = {}
def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
"""
Returns the indices that would sort the Categorical instance if
'sort_values' was called. This function is implemented to provide
compatibility with numpy ndarray objects.
While an ordering is applied to the category values, arg-sorting
in this context refers more to organizing and grouping together
based on matching category values. Thus, this function can be
called on an unordered Categorical instance unlike the functions
'Categorical.min' and 'Categorical.max'.
Returns
-------
argsorted : numpy array
See also
--------
numpy.ndarray.argsort
"""
# Keep the implementation here just for the docstring.
return super(Categorical, self).argsort(ascending=ascending, kind=kind,
> *args, **kwargs)
E TypeError: argsort() got multiple values for keyword argument 'ascending'
Changing the Categorical.argsort to accept just *args, **kwargs
fixes things, since ExtensionArray
does the argument validation, but it's a bit unfortunate.
pandas/core/arrays/base.py
Outdated
@@ -236,6 +237,52 @@ def isna(self): | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self): | |||
# type: () -> ndarray | |||
"""Get the ndarray to be passed to np.argsort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? shouldn't this one of our myriad of _values methods/properties here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_ndarray_valaues
seems like more of an internal thing, no? I don't know enough to say whether the _ndarray_values
appropriate for our current uses (mostly indexing IIRC) are also appropriate for argsort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone wants to override argsort great. but providing an indirect mechism is really kludgey.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is what is the point of overriding this specific one?
This PR is implementing argsort. I could see a similar pattern for factorize.
The proliferation of methods properties is really troublesome.
How so?
but providing an indirect mechism is really kludgey.
What's kudgey about it? I think the common case will be overriding the values provided to np.argsort
, not overriding the sorting algorithm itself. This is true for Categorical and IPArray, and will be true for Period and probably others.
Without _values_for_argsort
we, and 3rd party libraries, will have duplicate code for validating keyword arguments and the ascending kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily want to convert into extension arrays into the same NumPy array used for np.asarray()
.
For example, the IP Address extension array probably wants to convert into numpy object array of IPAddress object. But for sorting, it could just return numpy structured array with a few integer fields, which will be much faster for comparisons than Python IPAddress objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. In the IPArray case, there's a numpy array, so _values_for_argsort
is just that array.
OR simply call np.asarray() if you actually need an ndarray.
Then we're back in the duplicate code situation. That's OK for the base class, but Categorical, Period, and Interval will end up re-implementing argsort from scratch. With _values_for_argsort
, it's just a matter of constructing that array (codes, ordinals, concat left & right).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the name _values_for_argsort
, it's possible that we'll find other uses for it, in which case we just change the name.
Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc? I'm not sure ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc?
Well, we can scratch factorize / groupby off. Categorical defines _codes_for_groupby
separately from its ndarray_values
(codes)
pandas/pandas/core/arrays/categorical.py
Line 638 in 3783ccc
def _codes_for_groupby(self, sort): |
So we can't just use one array for everything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so then let's pick a name change _codes_for_groupby
. in this refactor we want to find other usecases and fix our code now rather than later.
something like: _int_mapping_for_values
pandas/core/arrays/base.py
Outdated
@@ -236,6 +237,52 @@ def isna(self): | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self): | |||
# type: () -> ndarray | |||
"""Get the ndarray to be passed to np.argsort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.
pandas/core/arrays/base.py
Outdated
@@ -236,6 +237,52 @@ def isna(self): | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self): | |||
# type: () -> ndarray | |||
"""Get the ndarray to be passed to np.argsort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if someone wants to override argsort great. but providing an indirect mechism is really kludgey.
pandas/core/arrays/base.py
Outdated
@@ -236,6 +237,52 @@ def isna(self): | |||
""" | |||
raise AbstractMethodError(self) | |||
|
|||
def _values_for_argsort(self): | |||
# type: () -> ndarray | |||
"""Get the ndarray to be passed to np.argsort. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this different from _ndarray_values
, which is implemented on EA arrays? having 2 is just plain confusing, not to mention you have a ``_formatting_values```.
The linting failure is fixed in #20330 |
Many of things we are discussing here, we will only encounter in practice (like the question if we will get many of such I am fine with the current PR (both Although having both |
👍 |
I think (hope?) then that we're OK proceeding with this as is then. c776133 unskips the sorting tests for JSONArray. The strategy is to sort an array of the dictionary items converted to tuples, which is maybe the most sensible way to sort a dictionary. Anyway, we have an example of sorting something that isn't regularly sortable. The remaining skip is for Then we'll have groupby, which is shaping up to be a surprisingly small change. Update: #20361 won't quite enable |
Require stable dictionary order
CI all passed. |
@jreback are you +1 here, given the discussion above? Since the last time around we discovered that |
@jreback thoughts? |
will look a bit later |
thanks @TomAugspurger I am still greatly concerned about the expansion of the private API. Since its completely private I guesss that's ok, but we should definitely try to be as minimal as possible and remove things that are duplicative / unecessary. More to the point, there is a fair amount of divergence between naming inside pandas internals and EA. (and numpy for that matter). This needs to be addressed in the short term. Otherwise we end up with a complex & confusing API that no-one will be able to contribute to in the future. Let's make a pro-active effort to minimize these frictions. (IOW create a new issue). |
Note that those APIs are not private. They are part of the extension array interface, so 'public' for extension authors.
First, I am not sure this would be a bad thing. EAs are new. If we go the route of Series and Index being composed of an ExtensionArray (and not subclassing them), then they are something different, and I think it is good to use different names for things that do something differently. But, I currently also don't see any divergence.
It is of course true that there is divergence with numpy arrays, but if that was not needed, we wouldn't have needed to make such an ExtensionArray in the first place. |
https://circleci.com/gh/pandas-dev/pandas/12793#tests/containers/3 is breaking. I think some json EA tests are depending on a certain ordering. |
My point is there are now 2 ways / names of doing things (in some cases). This is not great and should be addressed by re-aligning naming in the Series. A new reader to the codebase will be very confused on what to use when. Sure EA's are not exactly like Series/Index. But over the years we have gone to great lengths to make these look and feel and be implemented in a very similar way. So the divergence between Index / EA's is disturbing. My point is that before we move more on EA, Index's need to be subclassed and become real EA's. |
Which cases? Please be specific. I gave a list above, and I don't see any with 2 ways of naming. |
Index is not an EA subclass. |
Sorry, what is the relation with my previous comment? |
my point is that when Index becomes a subclass that all of the implementation details of it should follow a single convention (for construction / indexing) etc. These would naturally use the EA conventions. The problem now is they are different. This is just technical debt. |
@jreback I am not sure that Index will become a subclass of ExtensionArray, but let's not discuss that here, but rather in #19696 (comment) |
* Fixed factorize for MACArray Relies on pandas-dev/pandas#19957 * Build on na_value * Include groupby patch
This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort. These are required for
factorize
, which is required forgroupby
(my end goal).I haven't implemented
ExtensionArray.sort_values
yet because it hasn't become necessary. But I can if needed / desired.