[REVIEW] `Dataframe.sort_index` optimizations #9238

galipremsagar · 2021-09-16T00:20:39Z

This PR introduces optimizations to sort_index when there is an already sorted Index object and avoids sorting them and performing a take operation on them. This alleviates a lot of memory pressure and has a 3x to 6x speed up.

On a T4 GPU:

This PR:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})

In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

branch-21.10:

Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.

THIS PR:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

branch-21.10:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Also expands params to Series.sort_index and refactored the common implementation to Frame._sort_index.

…ort_index_optimizations

python/cudf/cudf/core/frame.py

codecov · 2021-09-16T02:57:14Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@40a3b03). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 62ce194 differs from pull request most recent head 2c8110b. Consider uploading reports for the commit 2c8110b to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #9238   +/-   ##
===============================================
  Coverage                ?   10.84%           
===============================================
  Files                   ?      115           
  Lines                   ?    18768           
  Branches                ?        0           
===============================================
  Hits                    ?     2035           
  Misses                  ?    16733           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40a3b03...2c8110b. Read the comment docs.

…ort_index_optimizations

isVoid

Besides below, consolidate df interface and series interface altogether as part of #9038 ?

python/cudf/cudf/core/frame.py

isVoid · 2021-09-16T19:01:18Z

python/cudf/cudf/core/frame.py

+            elif (ascending and self.index.is_monotonic_increasing) or (
+                not ascending and self.index.is_monotonic_decreasing
+            ):
+                outdf = self.copy()


Just wondering, is_monotonic_* is available for both Index and MultiIndex. Maybe this optimization can be applied regardless of object type?

We would have to adhere to extracting level, which will be a DataFrame and again round-trip that back to MultiIndex object to do an is_monotonic_* check which seems to be inefficient and memory consuming.

Also out of the context of this PR.. I can see the reason why we need to convert the index into a dataframe is because it's depending on argsort and take. Hopefully we can sink them into Frame so that there's no such need to convert to dataframes.

The difficulty of sinking argsort is that I believe Series depends on a single column sort while DataFrame depends on a multi column sort.

Co-authored-by: Michael Wang <[email protected]>

…om/galipremsagar/cudf into dataframe_sort_index_optimizations

isVoid · 2021-09-16T20:05:17Z

consolidate df interface and series interface altogether as part of #9038 ?

This is a special case because we might want to avoid Index.sort_index.

galipremsagar · 2021-09-16T23:13:04Z

@gpucibot merge

galipremsagar added 7 commits September 15, 2021 10:46

optimize DataFrame.sort_index

2a1b83b

fix sort_index

8e74cd0

doc

0eb3302

add tests

509dd6c

add Series.sort_index tests

36fe12e

Merge remote-tracking branch 'upstream/branch-21.10' into dataframe_s…

9c5dc43

…ort_index_optimizations

update api

dc85959

galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Sep 16, 2021

galipremsagar requested review from a team as code owners September 16, 2021 00:20

galipremsagar self-assigned this Sep 16, 2021

galipremsagar requested review from rgsl888prabhu, marlenezw and isVoid September 16, 2021 00:20

galipremsagar added this to the Pandas API Alignment and Coverage milestone Sep 16, 2021

galipremsagar added the 4 - Needs Dask Reviewer label Sep 16, 2021

Merge branch 'branch-21.10' into dataframe_sort_index_optimizations

2042aa9

galipremsagar commented Sep 16, 2021

View reviewed changes

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved

galipremsagar added 2 commits September 15, 2021 20:00

Update python/cudf/cudf/core/frame.py

b878e6b

Update python/cudf/cudf/core/frame.py

ffa07ea

Merge remote-tracking branch 'upstream/branch-21.10' into dataframe_s…

e0e375f

…ort_index_optimizations

isVoid reviewed Sep 16, 2021

View reviewed changes

Update python/cudf/cudf/core/frame.py

68f1e97

Co-authored-by: Michael Wang <[email protected]>

galipremsagar added 2 commits September 16, 2021 12:46

Merge branch 'dataframe_sort_index_optimizations' of https://github.c…

13e3e14

…om/galipremsagar/cudf into dataframe_sort_index_optimizations

style

2c8110b

isVoid approved these changes Sep 16, 2021

View reviewed changes

galipremsagar removed the 4 - Needs cuDF (Python) Reviewer label Sep 16, 2021

galipremsagar requested review from rjzamora and jakirkham September 16, 2021 20:33

quasiben approved these changes Sep 16, 2021

View reviewed changes

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Dask Reviewer labels Sep 16, 2021

rapids-bot bot merged commit 5a82585 into rapidsai:branch-21.10 Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] `Dataframe.sort_index` optimizations #9238

[REVIEW] `Dataframe.sort_index` optimizations #9238

galipremsagar commented Sep 16, 2021 •

edited

Loading

codecov bot commented Sep 16, 2021 •

edited

Loading

isVoid left a comment

isVoid Sep 16, 2021

galipremsagar Sep 16, 2021

isVoid Sep 16, 2021

isVoid commented Sep 16, 2021

galipremsagar commented Sep 16, 2021

[REVIEW] Dataframe.sort_index optimizations #9238

[REVIEW] Dataframe.sort_index optimizations #9238

Conversation

galipremsagar commented Sep 16, 2021 • edited Loading

codecov bot commented Sep 16, 2021 • edited Loading

Codecov Report

isVoid left a comment

Choose a reason for hiding this comment

isVoid Sep 16, 2021

Choose a reason for hiding this comment

galipremsagar Sep 16, 2021

Choose a reason for hiding this comment

isVoid Sep 16, 2021

Choose a reason for hiding this comment

isVoid commented Sep 16, 2021

galipremsagar commented Sep 16, 2021

[REVIEW] `Dataframe.sort_index` optimizations #9238

[REVIEW] `Dataframe.sort_index` optimizations #9238

galipremsagar commented Sep 16, 2021 •

edited

Loading

codecov bot commented Sep 16, 2021 •

edited

Loading