Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix Series.argsort #42090

Closed

Conversation

attack68
Copy link
Contributor

@attack68 attack68 commented Jun 17, 2021

@mzeitlin11
Copy link
Member

Would this handle #12694?

@attack68
Copy link
Contributor Author

Would this handle #12694?

Yes this fixes that.

It solves it by:

  • keeping the '-1' for null values (so works for all dtypes), but doing the indexing correctly.
  • sorting the index for the return series so that there is meaning to its presence.

@attack68 attack68 marked this pull request as ready for review June 19, 2021 14:07
@attack68 attack68 changed the title [WIP] BUG: fix Series.argsort BUG: fix Series.argsort Jun 19, 2021
@attack68 attack68 added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jun 29, 2021
@attack68
Copy link
Contributor Author

attack68 commented Jul 7, 2021

@jorisvandenbossche since you commented on the underlying issue perhaps you would also like to review?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't really looked, just some quick comments

res_ser = Series(res, index=self.index[res], dtype="int64", name=self.name)
return res_ser.__finalize__(self, method="argsort")
else:
# GH 42090
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be very clear that we are ordering nulls at the na_position='last' (equiv of sort_values). i think its fine not to expose this as an option (as numpy doesn't).

pandas/core/series.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

@attack68 thanks for looking into this topic!

Took a quick look, and some high-level comments:

  • I agree that the current behaviour of the index doesn't seem very useful (and is confusing), but we shouldn't regard this as a "bug fix" if we completely change the output's index (that's a breaking change)
    • But further thinking about it, is the new behaviour actually ever useful? When would you want to use this index? If we change the behaviour of this method, we could maybe also consider returning an array instead of a Series (and then we don't have this question about what the index should be)
  • For the NaN issue, you are now keeping the -1, but another option to fix this part could be to actually give this the proper index value that would sort the series correctly? (as numpy does)

@attack68
Copy link
Contributor Author

attack68 commented Jul 8, 2021

  • I agree that the current behaviour of the index doesn't seem very useful (and is confusing), but we shouldn't regard this as a "bug fix" if we completely change the output's index (that's a breaking change)

I interpret a "bug-fix" as a breaking change that corrects erroneous behaviour, and as pointed out by others this function is currently unusable because of its errors:

  • one, the index values returned are wrong if interpreted as being the integer indexes over a Series, which contains Nans.
  • two, since values are re-ordered and the index is not re-ordered any index-value relationship is broken and made completely meaningless - worse that if anyone interprets it as having meaning will lead to bugs in their program.

But further thinking about it, is the new behaviour actually ever useful? When would you want to use this index? If we change the behaviour of this method, we could maybe also consider returning an array instead of a Series (and then we don't have this question about what the index should be)

My use of numpy argsort has has been to sort an array by the values of another similar shape array, which is just a kind of key-sorting: array2[array1.argsort()], which was useful in a nearest neighbour algorithm as I recall.

Since numpy arrays don't have index objects per se this acts as an index sort. Since pandas data structures have index objects you can perform the same kind of method on objects with shared indexes with: ser2[ser1.sort_values().index].
So argsort is not really necessary.

But, there are often different ways of doing things. And someone coming from numpy used to using argsort for this type of thing might instinctively look for same functionality. In this PR the following is now possible:

ser2.iloc[ser1.argsort()] or ser2[ser.argsort().index]

I have added "first" and "last" to na_position so this does all now.

attack68 added 5 commits July 8, 2021 10:46
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
@jreback
Copy link
Contributor

jreback commented Sep 18, 2021

@attack68 my point is that even though this is a bug, it is likely relied upon. So in an ideal situation we have np.argsort(..) and .argsort do the same thing. This change might be too big for a simple change and requires deprecation / hard break.

OK, I understand, but by the same token keeping the function as is for the sake of reliability is a valid argument with existing users who have managed to work around it, but invalid with respect to new users of the function for whom it will be unreliable, e.g. the original poster of the underlying issue (not me).

Can I convert the PR to indicate a breaking change?

can you add a whatsnew note that shows the previous and new behavior; then it will be easier to assess

@attack68
Copy link
Contributor Author

rendered whatsnew:
Screen Shot 2021-09-18 at 13 56 52

@attack68
Copy link
Contributor Author

@jorisvandenbossche perhaps you care to review the whatsnew above?

@attack68 attack68 requested a review from jreback September 24, 2021 16:41
@attack68
Copy link
Contributor Author

attack68 commented Oct 2, 2021

noting #43840

…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	pandas/core/series.py
#	pandas/tests/extension/base/methods.py
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
…_index_sorted

# Conflicts:
#	doc/source/whatsnew/v1.4.0.rst
@attack68
Copy link
Contributor Author

linking #45475 and #45434 for any relevance if anyone takes this forward

@mroeschke
Copy link
Member

Appears that this PR has sufficiently stalled, so closing for now. Looks almost there is anyone wants to pick it up.

@attack68
Copy link
Contributor Author

Is it worth reviving this for pandas 2.0?
As shown above the currenct argsort behiour is useless in some circumstances.

@attack68 attack68 reopened this Nov 22, 2022
@jreback
Copy link
Contributor

jreback commented Nov 22, 2022

i think would either do

  • remove .argsort entirely
  • have it return a numpy array of indexers

it does the wrong thing now - while your PR does seems to fix it up - its essentially duplicating the sorting methods intent

@simonjayhawkins
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.Series.argsort misaligned index BUG: Series.argsort() incorrect handling of NaNs
6 participants