API: unified sorting #8239

patricktokeeffe · 2014-09-11T00:49:44Z

originally #5190
xref #9816
xref #3942

This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort (except Series version retains current inplace=True):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None

Proposed changes:

~~make inplace=False default (changes Series.sort)~~ maybe, possibly in 1.0
new by argument to accept column-name/list-of-column-names in first position
- deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
- don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
- don't swap order of by/axis in DataFrame.sort_index (see change 7)
- this argument is ignored by series but axis is too so for the sake of working with dataframes, it gets first position
new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
- replaces tuple behavior of columns arg of DataFrame.sort
- add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
- also adds sort_remaining arg to handle multi-level indexes
new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
deprecate Series.order since change 1 makes Series.sort equivalent (?)
add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

default behavior of sort is still object-dependent: for series, sorts by values and for data frames, sorts by index
new level arg makes sort_index and sortlevel equivalent. if sortlevel is retained:
- should rename sortlevel to sort_level for naming conventions
- Series.sortlevel should have inplace argument added
- maybe don't add level and sort_remaining args to sort_index so it's not equivalent to sort_level (intentionally limiting sort_index seems like a bad idea though)
it's unclear if default should be level=None for sort_columns. probably not since level=None falls back to level=0 anyway
both by and axis arguments should be ignored by Series.sort

Syntax:

dataframes
- sort() == sort(level=0) == sort_index() == sortlevel()
  - without columns or level specified, defaults to current behavior of sort on index
- sort(['A','B'])
  - since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
- sort(level='spam') == sort_index('spam') == sortlevel('spam')
  - sort occurs on row index named 'spam' or level of multi-index named 'spam'
- sort(['A','B'], level='spam')
  - level controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
- sort(axis=1) == sort(axis=1, level=0) == sort_columns()
  - since data frames default to sort on index, leaving level=None is the same as level=0
- sort(['A','B'], axis=1) == sort_columns(['A','B'])
  - as with preceding example, level=None becomes level=0 in sort_columns
- sort(['A','B'], axis=1, level='spam') == sort_columns(['A','B'], level='spam')
  - axis controls level so sort will be on columns named 'A' and 'B' in column index named 'spam'
series:
- sort() == order() -- sorts on values
- with level specified, sorts on index/named index/level of multi-index:
  - sort(level=0) == sort_index() == sortlevel()
  - sort(level='spam') == sort_index('spam') == sortlevel('spam')

Comments welcome.

The text was updated successfully, but these errors were encountered:

hayd · 2014-09-12T23:43:16Z

make inplace=False default (changes Series.sort)

-1. Although I'm not fan of this behaviour, IMO we're stuck with it*, this would break lots of code and there's no way to make a clean depreciation/migration path.

+1 to cleaning up sort API, the other changes seem reasonable.

*Perhaps we should name everything order ?? :S

patricktokeeffe · 2014-09-13T03:22:05Z

Regarding inplace=False as new default: could a scary warning be added to 0.15 without changing the current behavior and then in 0.16 or 0.17 make the switch?

I tried to come up with a way to detect when users might expect the original behavior... couldn't find anything clean. The best one was probably:

def sort( ..., inplace=None):
    ...
    inplace = True if inplace is None else False 
    ...

Unfortunately it's still a pretty bad idea. Backward-compatible, sure, but it's really just kicking the can down the road.

hayd · 2014-09-13T04:49:26Z

I just don't see the value proposition in this particular breakage, it will affect a lot of users, and you're not even "fixing" anything (i.e. fixing their buggy code) - you'd just be changing syntax. To quote @y-p:

if a 1000 people need to modify working code to accomodate this change, is it still worth doing?

I'd say I was on the liberal side of API breakages, but I don't see how this one can fly!

patricktokeeffe · 2014-09-13T05:49:07Z

The more unified the sort API becomes, the more glaring the inplace inconsistency will be. That said, I think the argument is stronger to have consistent behavior vs. consistent signatures. Such a change should wait for a major version bump. (wait, who said 1.0?)

So keeping inplace=false for Series.sort means:

DataFrame.sort and Series.sort have (slightly) different signatures -- no biggie: docstrings
Series.sort would not become equivalent to Series.order -- but we still deprecate order? (I lean yes but maybe it's quite popular)
should new inplace keyword to Series.sort_index be False (to match df sorts) or True (to match s.sort)? I think False is correct so s.sort is the isolated case
if it can be agreed to change s.sort inplace in 1.0, should a warning be given?

hayd · 2014-09-13T06:02:04Z

if it can be agreed to change s.sort inplace in 1.0

big if! Definitely such a change should be discussed in the ML, but I think it's a tough sell.

I agree the inconsistency sucks, but practicality beats purity.... and this will (fairly) annoy a lot of people. I think if you're changing the API there needs to be some carrot cake rather than just stick (with this change I just see stick).

I was being serious about using/preferring .order, that route has the benefit of not being an API change and not differing in behaviour from both numpy and python itself (lists). Better long-term?

Edit: To me "sort" sounds inplace, whereas "order" is temporary arrangement.

patricktokeeffe · 2014-09-13T06:35:11Z

OK, that edit-note makes sense to me. I'll have a look at order and come back with some revisions.

jorisvandenbossche · 2015-04-06T15:13:08Z

@patricktokeeffe see also #9816

jorisvandenbossche · 2015-05-13T12:56:17Z

don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead

I don't think this is equivalent. columns with a tuple is to be able to specify to sort on a specific column when having mult-indexed columns, while level is to specify a certain level of the multi-index index to sort by (when axis=0).

patricktokeeffe · 2015-08-02T23:01:29Z

There really isn't a good compromise here; at least, I didn't find one. But #9816 has a good solution, see #10726.

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 11, 2014

jreback added this to the 0.16 milestone Sep 11, 2014

jreback mentioned this issue Sep 16, 2014

CLN/TST: move consoliate sort_index to core/generic.py #8283

Closed

jreback modified the milestones: 0.16, 0.15.1 Oct 7, 2014

jreback mentioned this issue Jan 26, 2015

Enable referring to index level in functions like DataFrame.sort_index #2615

Closed

jreback added the Master Tracker High level tracker for similar issues label Mar 6, 2015

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Apr 5, 2015

Feature request: sorted() methods on everything #9816

Closed

SethMMorton mentioned this issue Apr 11, 2015

Add key option to sort methods #9855

Closed

jorisvandenbossche mentioned this issue Jul 7, 2015

sort seems to sort inplace by default unlike documentation #10522

Closed

jreback mentioned this issue Aug 2, 2015

API/WIP: .sorted #10726

Merged

patricktokeeffe closed this as completed Aug 2, 2015

jreback added a commit to jreback/pandas that referenced this issue Aug 18, 2015

ENH/DEPR: add .sorted() method for API consistency, pandas-dev#9816, p…

13d2d71

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: unified sorting #8239

API: unified sorting #8239

patricktokeeffe commented Sep 11, 2014

hayd commented Sep 12, 2014

patricktokeeffe commented Sep 13, 2014

hayd commented Sep 13, 2014

patricktokeeffe commented Sep 13, 2014

hayd commented Sep 13, 2014

patricktokeeffe commented Sep 13, 2014

jorisvandenbossche commented Apr 6, 2015

jorisvandenbossche commented May 13, 2015

patricktokeeffe commented Aug 2, 2015