Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: ExtensionArray.argsort places the missing value at the end #27137

Merged
merged 12 commits into from
Jul 3, 2019

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented Jun 30, 2019

continue #26354

  1. Categorical will be handled in a separate PR.
  2. benchmark result:
       before           after         ratio
     [989f912e]       [dc0ef701]
     <master>         <enh-21801-2>
+     11.4±0.03μs       36.8±0.1μs     3.23  algorithms.EASorting.time_argsort(1000)
+       785±0.5μs      1.26±0.03ms     1.60  algorithms.EASorting.time_argsort(100000)

xref #21801 (doesn't close; Need to update existing EAs)

@jorisvandenbossche jorisvandenbossche added this to the 0.25.0 milestone Jun 30, 2019
@jreback jreback removed this from the 0.25.0 milestone Jul 2, 2019
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 2, 2019
Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makbigc can you add a test in the base extension array tests?

diff --git a/pandas/tests/extension/base/methods.py b/pandas/tests/extension/base/methods.py
index d9e61e6a2..17315a52e 100644
--- a/pandas/tests/extension/base/methods.py
+++ b/pandas/tests/extension/base/methods.py
@@ -52,6 +52,11 @@ class BaseMethodsTests(BaseExtensionTests):
         expected = pd.Series(np.array([1, -1, 0], dtype=np.int64))
         self.assert_series_equal(result, expected)
 
+    def test_argsort_missing_array(self, data_missing_for_sorting):
+        result = np.argsort(data_missing_for_sorting)
+        expected = data_missing_for_sorting.take([2, 0, 1])
+        self.assert_extension_array_equal(result, expected)
+
     @pytest.mark.parametrize('na_position, expected', [
         ('last', np.array([2, 0, 1], dtype=np.dtype('intp'))),
         ('first', np.array([1, 2, 0], dtype=np.dtype('intp')))

If you don't have time today, LMK. I may push some changes to get this in.

asv_bench/benchmarks/algorithms.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved
* rework test
* xfail categorical
* rename asv
* update whatsnew
@TomAugspurger
Copy link
Contributor

FYI @jorisvandenbossche just doing the breaking Categorical change here.

@TomAugspurger
Copy link
Contributor

As expected, a few failures in our test suite. Going through them next.

I think https://travis-ci.org/pandas-dev/pandas/jobs/553767675#L2498 is buggy / defined strangely. (DataFrame.sort_index with a CategoricalIndex and ascending=False)

# 0.24
In [2]: df = pd.DataFrame({"A": list(range(6))}, index=pd.Categorical(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b']))

In [3]: df.sort_index(ascending=False)
Out[3]:
   A
b  3
b  2
a  5
a  1
a  0
c  4

In [4]: df.set_index(df.index.codes).sort_index(ascending=False)
Out[4]:
   A
2  2
2  3
1  0
1  1
1  5
0  4
# this PR

In [3]: df.sort_index()
Out[3]:
   A
c  4
a  0
a  1
a  5
b  2
b  3

In [4]: df.set_index(df.index.codes).sort_index()
Out[4]:
   A
0  4
1  0
1  1
1  5
2  2
2  3

I think that sorting by a CategoricalIndex should be the same as sorting by its codes. 0.24 (and master, before the PR) doesn't have that property.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 3, 2019

It's probably more a comment on the changes in #26854, but it seems that nargsort is now always calling _values_for_argsort for EAs. But, in the comments of ExtensionArray.argsort, we say that there are two ways to override argsort: have _values_for_argsort, or implement argsort yourself for full control.

So the current indirection seems a bit wrong. At least depending on where nargsort is called internally, it should call the argsort method of the EA, to let the EA have full control over it.

(but that is maybe not blocking for the RC, most important is to get the actual behaviour change in)

@TomAugspurger
Copy link
Contributor

Hmm, so we would want to tell users that they should always override _values_for_argsort, since just doing .argsort may not be enough. OK, I'll try to write something about that.

@jorisvandenbossche
Copy link
Member

Hmm, so we would want to tell users that they should always override _values_for_argsort, since just doing .argsort may not be enough.

Not fully sure. We could also pass _values_for_argsort to nargsort from withing EA.argsort, and let nargsort call EA.argsort (I think, might be missing something in this indirection web).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 3, 2019

I'm also having trouble with the indirection.

AFAICT, with this PR we end up in a situation where ExtensionArray.argsort could differ from other places in pandas that do nargsort(ExtensionArray). Since nargsort Does pandas' usual sorting algo on the ExtensionArray._values_for_argsort...

@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Jul 3, 2019
@TomAugspurger
Copy link
Contributor

All green. I had to slightly adjust the test in 47c6b57

I don't have a 32-bit machine to test on, but it seems like we may be inconsistent with whether _values_for_argsort returns an int32 or int64 dtype array. It doesn't seem to matter in the algorithm, so I cast to int64 before making the assert.

@jreback
Copy link
Contributor

jreback commented Jul 3, 2019

All green. I had to slightly adjust the test in 47c6b57

I don't have a 32-bit machine to test on, but it seems like we may be inconsistent with whether _values_for_argsort returns an int32 or int64 dtype array. It doesn't seem to matter in the algorithm, so I cast to int64 before making the assert.

I believe this will break on the 32-bit builds; argsort is supposed to return platform ints.

So I guess our current interface is kind wrong. But let's merge this and consider that.

@@ -409,7 +410,8 @@ def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
Returns
-------
index_array : ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we aren't specific that these are actually intp (and NOT int64's)

@jreback jreback merged commit 47ffcd6 into pandas-dev:master Jul 3, 2019
@jorisvandenbossche
Copy link
Member

AFAICT, with this PR we end up in a situation where ExtensionArray.argsort could differ from other places in pandas that do nargsort(ExtensionArray). Since nargsort Does pandas' usual sorting on the ExtensionArray._values_for_argsort...

Let's open an issue about that to fix after the RC before final 0.25.0 ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 3, 2019 via email

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants