Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor implementation of column setitem #9110

Merged
merged 8 commits into from
Aug 26, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Aug 24, 2021

This small PR reworks the behavior of ColumnBase.__setitem__ when it is provided something other than a slice as input, for instance an array. This code path requires scattering the new values into the column, which previously involved converting the Column to a Frame in order to call Frame._scatter. Since that method was only used for this one purpose, the underlying libcudf scatter implementation has been rewritten to accept and return Columns, allowing us to inline the call and also get rid of a round trip from Column to Frame and back.

@vyasr vyasr added 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 24, 2021
@vyasr vyasr added this to the CuDF Python Refactoring milestone Aug 24, 2021
@vyasr vyasr self-assigned this Aug 24, 2021
@vyasr vyasr requested a review from a team as a code owner August 24, 2021 23:40
@vyasr
Copy link
Contributor Author

vyasr commented Aug 24, 2021

Although not necessarily the main purpose of this PR, it's worth noting that it does improve performance a bit, although nothing groundbreaking (a little under 10%).

Before:

(rapids) rapids@compose:~/cudf$ ipython -i -c "import cudf; import numpy as np; df = cudf.DataFrame({'a': 
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.np.arange(10000)});"

In [1]: %timeit df[df > 5000] = 10
1.24 ms ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit df[df > 5000] = 10
1.25 ms ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df[df > 5000] = 10
1.24 ms ± 7.92 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After:

(rapids) rapids@compose:~/cudf$ ipython -i -c "import cudf; import numpy as np; df = cudf.DataFrame({'a': np.arange(10000)});"
Python 3.7.10 | packaged by conda-forge | (default, Feb 19 2021, 16:07:37)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.26.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: %timeit df[df > 5000] = 10
1.15 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [2]: %timeit df[df > 5000] = 10
1.17 ms ± 7.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df[df > 5000] = 10
1.16 ms ± 3.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@codecov
Copy link

codecov bot commented Aug 25, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@a153493). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9110   +/-   ##
===============================================
  Coverage                ?   10.76%           
===============================================
  Files                   ?      114           
  Lines                   ?    19083           
  Branches                ?        0           
===============================================
  Hits                    ?     2054           
  Misses                  ?    17029           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a153493...4334941. Read the comment docs.

Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!!

@shwina
Copy link
Contributor

shwina commented Aug 26, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 40cad38 into rapidsai:branch-21.10 Aug 26, 2021
@vyasr vyasr deleted the refactor/scatter_impl branch January 23, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants