Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Performance bottleneck in DataFrame.sort_index when there is a RangeIndex #9234

Closed
galipremsagar opened this issue Sep 15, 2021 · 0 comments · Fixed by #9238
Closed
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
When a DataFrame is having it's index as RangeIndex and it's already sorted, there is no need to materialize and sort RangeIndex again. This also deviates from pandas behavior:

Steps/Code to reproduce bug

>>> import cudf
>>> df = cudf.DataFrame({'a':[1, 2, 3]})
>>> df
   a
0  1
1  2
2  3
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.sort_index().index
Int64Index([0, 1, 2], dtype='int64')
>>> df.to_pandas().sort_index().index
RangeIndex(start=0, stop=3, step=1)

Expected behavior

>>> df.sort_index().index
RangeIndex(start=0, stop=3, step=1)
>>> df.to_pandas().sort_index().index
RangeIndex(start=0, stop=3, step=1)
@galipremsagar galipremsagar added bug Something isn't working Python Affects Python cuDF API. labels Sep 15, 2021
@galipremsagar galipremsagar self-assigned this Sep 15, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 16, 2021
Fixes: #9234

- [x] This PR introduces optimizations to `sort_index` when there is an already sorted `Index` object and avoids sorting them and performing a `take` operation on them. This **alleviates** a lot of **memory pressure** and has **a 3x to 6x speed up.**

On a T4 GPU:

`This PR`:
```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*100000000, 'b':['a', 'b', 'c']*100000000, 'c':[0.0, 0.12, 10.12]*100000000})

In [3]: %timeit df.sort_index()
174 ms ± 368 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

`branch-21.10`:

Won't fit into memory and will error :( on T4 as it tries to perform argsort on an already sorted index.


`THIS PR`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
69.1 ms ± 221 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
15.2 ms ± 213 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
72.6 ms ± 433 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
24.1 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```


`branch-21.10`:

```python
In [1]: import cudf

In [2]: df = cudf.DataFrame({'a':[1, 2, 3]*10000000, 'b':['a', 'b', 'c']*10000000, 'c':[0.0, 0.12, 10.12]*10000000})

In [3]: %timeit df.sort_index(ascending=False)
71.6 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.sort_index()
71.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: df_reversed = df[::-1]

In [6]: %timeit df_reversed.sort_index()
69.1 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit df_reversed.sort_index(ascending=False)
69 ms ± 127 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

- [x] Also expands params to `Series.sort_index` and refactored the common implementation to `Frame._sort_index`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Michael Wang (https://github.com/isVoid)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #9238
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant