Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: unified sorting #8239

Closed
patricktokeeffe opened this issue Sep 11, 2014 · 9 comments · Fixed by #10726
Closed

API: unified sorting #8239

patricktokeeffe opened this issue Sep 11, 2014 · 9 comments · Fixed by #10726
Labels
API Design Master Tracker High level tracker for similar issues Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@patricktokeeffe
Copy link
Contributor

originally #5190
xref #9816
xref #3942

This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort (except Series version retains current inplace=True):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None

Proposed changes:

  1. make inplace=False default (changes Series.sort) maybe, possibly in 1.0
  2. new by argument to accept column-name/list-of-column-names in first position
    • deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
    • don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
    • don't swap order of by/axis in DataFrame.sort_index (see change 7)
    • this argument is ignored by series but axis is too so for the sake of working with dataframes, it gets first position
  3. new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
    • replaces tuple behavior of columns arg of DataFrame.sort
    • add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
    • also adds sort_remaining arg to handle multi-level indexes
  4. new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
  5. deprecate Series.order since change 1 makes Series.sort equivalent (?)
  6. add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
  7. deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
  8. deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

  • default behavior of sort is still object-dependent: for series, sorts by values and for data frames, sorts by index
  • new level arg makes sort_index and sortlevel equivalent. if sortlevel is retained:
    • should rename sortlevel to sort_level for naming conventions
    • Series.sortlevel should have inplace argument added
    • maybe don't add level and sort_remaining args to sort_index so it's not equivalent to sort_level (intentionally limiting sort_index seems like a bad idea though)
  • it's unclear if default should be level=None for sort_columns. probably not since level=None falls back to level=0 anyway
  • both by and axis arguments should be ignored by Series.sort

Syntax:

  • dataframes
    • sort() == sort(level=0) == sort_index() == sortlevel()
      • without columns or level specified, defaults to current behavior of sort on index
    • sort(['A','B'])
      • since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
    • sort(level='spam') == sort_index('spam') == sortlevel('spam')
      • sort occurs on row index named 'spam' or level of multi-index named 'spam'
    • sort(['A','B'], level='spam')
      • level controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
    • sort(axis=1) == sort(axis=1, level=0) == sort_columns()
      • since data frames default to sort on index, leaving level=None is the same as level=0
    • sort(['A','B'], axis=1) == sort_columns(['A','B'])
      • as with preceding example, level=None becomes level=0 in sort_columns
    • sort(['A','B'], axis=1, level='spam') == sort_columns(['A','B'], level='spam')
      • axis controls level so sort will be on columns named 'A' and 'B' in column index named 'spam'
  • series:
    • sort() == order() -- sorts on values
    • with level specified, sorts on index/named index/level of multi-index:
      • sort(level=0) == sort_index() == sortlevel()
      • sort(level='spam') == sort_index('spam') == sortlevel('spam')

Comments welcome.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 11, 2014
@jreback jreback added this to the 0.16 milestone Sep 11, 2014
@hayd
Copy link
Contributor

hayd commented Sep 12, 2014

make inplace=False default (changes Series.sort)

-1. Although I'm not fan of this behaviour, IMO we're stuck with it*, this would break lots of code and there's no way to make a clean depreciation/migration path.

+1 to cleaning up sort API, the other changes seem reasonable.

*Perhaps we should name everything order ?? :S

@patricktokeeffe
Copy link
Contributor Author

Regarding inplace=False as new default: could a scary warning be added to 0.15 without changing the current behavior and then in 0.16 or 0.17 make the switch?

I tried to come up with a way to detect when users might expect the original behavior... couldn't find anything clean. The best one was probably:

def sort( ..., inplace=None):
    ...
    inplace = True if inplace is None else False 
    ...

Unfortunately it's still a pretty bad idea. Backward-compatible, sure, but it's really just kicking the can down the road.

@hayd
Copy link
Contributor

hayd commented Sep 13, 2014

I just don't see the value proposition in this particular breakage, it will affect a lot of users, and you're not even "fixing" anything (i.e. fixing their buggy code) - you'd just be changing syntax. To quote @y-p:

if a 1000 people need to modify working code to accomodate this change, is it still worth doing?

I'd say I was on the liberal side of API breakages, but I don't see how this one can fly! :rage1:

@patricktokeeffe
Copy link
Contributor Author

The more unified the sort API becomes, the more glaring the inplace inconsistency will be. That said, I think the argument is stronger to have consistent behavior vs. consistent signatures. Such a change should wait for a major version bump. (wait, who said 1.0?)

So keeping inplace=false for Series.sort means:

  • DataFrame.sort and Series.sort have (slightly) different signatures -- no biggie: docstrings
  • Series.sort would not become equivalent to Series.order -- but we still deprecate order? (I lean yes but maybe it's quite popular)
  • should new inplace keyword to Series.sort_index be False (to match df sorts) or True (to match s.sort)? I think False is correct so s.sort is the isolated case
  • if it can be agreed to change s.sort inplace in 1.0, should a warning be given?

@hayd
Copy link
Contributor

hayd commented Sep 13, 2014

if it can be agreed to change s.sort inplace in 1.0

big if! Definitely such a change should be discussed in the ML, but I think it's a tough sell.

I agree the inconsistency sucks, but practicality beats purity.... and this will (fairly) annoy a lot of people. I think if you're changing the API there needs to be some carrot cake rather than just stick (with this change I just see stick).

I was being serious about using/preferring .order, that route has the benefit of not being an API change and not differing in behaviour from both numpy and python itself (lists). Better long-term?

Edit: To me "sort" sounds inplace, whereas "order" is temporary arrangement.

@patricktokeeffe
Copy link
Contributor Author

OK, that edit-note makes sense to me. I'll have a look at order and come back with some revisions.

@jorisvandenbossche
Copy link
Member

@patricktokeeffe see also #9816

@jorisvandenbossche
Copy link
Member

don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead

I don't think this is equivalent. columns with a tuple is to be able to specify to sort on a specific column when having mult-indexed columns, while level is to specify a certain level of the multi-index index to sort by (when axis=0).

@patricktokeeffe
Copy link
Contributor Author

There really isn't a good compromise here; at least, I didn't find one. But #9816 has a good solution, see #10726.

jreback added a commit to jreback/pandas that referenced this issue Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Master Tracker High level tracker for similar issues Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants