Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Dataframe.sort_index optimizations #9238

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented Sep 16, 2021

Fixes: #9234

  • This PR introduces optimizations to sort_index when there is an already sorted Index object and avoids sorting them and performing a take operation on them. This alleviates a lot of memory pressure and has a 3x to 6x speed up.

On a T4 GPU:

This PR:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})

In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

branch-21.10:

Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.

THIS PR:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

branch-21.10:

In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Also expands params to Series.sort_index and refactored the common implementation to Frame._sort_index.

@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Sep 16, 2021
@galipremsagar galipremsagar requested review from a team as code owners September 16, 2021 00:20
@galipremsagar galipremsagar self-assigned this Sep 16, 2021
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 16, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@40a3b03). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 62ce194 differs from pull request most recent head 2c8110b. Consider uploading reports for the commit 2c8110b to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9238   +/-   ##
===============================================
  Coverage                ?   10.84%           
===============================================
  Files                   ?      115           
  Lines                   ?    18768           
  Branches                ?        0           
===============================================
  Hits                    ?     2035           
  Misses                  ?    16733           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40a3b03...2c8110b. Read the comment docs.

Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides below, consolidate df interface and series interface altogether as part of #9038 ?

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
Comment on lines +502 to +505
elif (ascending and self.index.is_monotonic_increasing) or (
not ascending and self.index.is_monotonic_decreasing
):
outdf = self.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, is_monotonic_* is available for both Index and MultiIndex. Maybe this optimization can be applied regardless of object type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to adhere to extracting level, which will be a DataFrame and again round-trip that back to MultiIndex object to do an is_monotonic_* check which seems to be inefficient and memory consuming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also out of the context of this PR.. I can see the reason why we need to convert the index into a dataframe is because it's depending on argsort and take. Hopefully we can sink them into Frame so that there's no such need to convert to dataframes.

The difficulty of sinking argsort is that I believe Series depends on a single column sort while DataFrame depends on a multi column sort.

@isVoid
Copy link
Contributor

isVoid commented Sep 16, 2021

consolidate df interface and series interface altogether as part of #9038 ?

This is a special case because we might want to avoid Index.sort_index.

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Dask Reviewer labels Sep 16, 2021
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5a82585 into rapidsai:branch-21.10 Sep 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Performance bottleneck in DataFrame.sort_index when there is a RangeIndex
3 participants