Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Index and Series binops #8166

Merged
merged 35 commits into from
May 6, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented May 5, 2021

This PR builds on #8115, moving all binary operations from Index and Series into the SingleColumnFrame class. It really should be a negative LOC change, but it doesn't look like it for two reasons: 1) Index objects require some special handling due to the awkwardness of needing to return the right type of Index, which is frequently not the type that is being operated on (e.g. RangeIndex + RangeIndex results in an Int64Index), and that's something we'll want to refactor in a future PR, and 2) I've added a significant number of comments both in the form of docstrings and to give context for the issues arising from (1). This PR also significantly speeds up all binary operations for Index objects because it removes the round-tripping of data from Index->Series->Index that was previously being done to implement binary operations. The percent speedup depends on how expensive the operation itself is, but having tested for a number of data sizes it is >=15%, ranging up to 40% for simpler operations like __ne__. Benchmarks to follow.

vyasr added 30 commits May 3, 2021 16:02
…aFrame."

This reverts commit a49f3748c7d19fa39181df05549c7a7dcced4a9a.
@vyasr vyasr added Python Affects Python cuDF API. Performance Performance related issue tech debt improvement Improvement / enhancement to an existing function labels May 5, 2021
@vyasr vyasr self-assigned this May 5, 2021
@vyasr vyasr requested a review from a team as a code owner May 5, 2021 19:50
@vyasr vyasr requested review from galipremsagar and isVoid May 5, 2021 19:50
@vyasr vyasr added the non-breaking Non-breaking change label May 5, 2021
@vyasr
Copy link
Contributor Author

vyasr commented May 5, 2021

Here's a snapshot of the performance improvements for doing binary operations. Each line should be compared to the line directly below it so that the only difference is whether it's branch-0.20 or now (e.g. test_binops_operator[Index-__add__] (branch-0.20) vs test_binops_operator[Index-__add__] (now)). These benchmarks are for index objects with 1e5 elements, but performance improvements aren't too sensitive to the data size in my quick measurements.

---------------------------------------------------------------- benchmark: 20 tests --------------------------------------------------
Name (time in us)                                            Min                   Max                Mean              StdDev
---------------------------------------------------------------------------------------------------------------------------------------
test_binops_operator[Index-__add__] (branch-0.20)       572.6870 (3.75)     1,288.1430 (6.54)     585.8755 (3.72)      33.9046 (13.42)
test_binops_operator[Index-__add__] (now)               490.2440 (3.21)     4,904.6450 (24.92)    507.0175 (3.22)     114.8309 (45.46)
test_binops_operator[Index-__and__] (branch-0.20)       596.6690 (3.91)     2,373.7149 (12.06)    608.5050 (3.86)      58.7923 (23.27)
test_binops_operator[Index-__and__] (now)               506.3789 (3.32)     3,787.2849 (19.24)    519.4113 (3.30)      88.6736 (35.10)
test_binops_operator[Index-__eq__] (branch-0.20)        220.9859 (1.45)       868.1382 (4.41)     224.6789 (1.43)      11.8319 (4.68)
test_binops_operator[Index-__eq__] (now)                158.7330 (1.04)       196.8220 (1.0)      162.3566 (1.03)       2.5262 (1.0)
test_binops_operator[Index-__mul__] (branch-0.20)       571.9608 (3.75)     3,250.0541 (16.51)    582.9092 (3.70)      75.1271 (29.74)
test_binops_operator[Index-__mul__] (now)               490.6100 (3.22)     3,420.5390 (17.38)    503.0430 (3.19)      87.9270 (34.81)
---------------------------------------------------------------------------------------------------------------------------------------

python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/index.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented May 5, 2021

Codecov Report

Merging #8166 (a930a11) into branch-0.20 (51336df) will decrease coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #8166      +/-   ##
===============================================
- Coverage        82.88%   82.86%   -0.03%     
===============================================
  Files              103      104       +1     
  Lines            17668    17884     +216     
===============================================
+ Hits             14645    14820     +175     
- Misses            3023     3064      +41     
Impacted Files Coverage Δ
python/cudf/cudf/core/abc.py 91.66% <ø> (+0.17%) ⬆️
python/cudf/cudf/core/algorithms.py 82.35% <ø> (ø)
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 92.37% <ø> (+0.13%) ⬆️
python/cudf/cudf/core/column/column.py 88.20% <ø> (-0.44%) ⬇️
python/cudf/cudf/core/column/datetime.py 88.03% <ø> (-1.88%) ⬇️
python/cudf/cudf/core/column/decimal.py 91.04% <ø> (-1.89%) ⬇️
python/cudf/cudf/core/column/interval.py 91.66% <ø> (+0.55%) ⬆️
python/cudf/cudf/core/column/lists.py 86.98% <ø> (-0.43%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.72% <ø> (+0.29%) ⬆️
... and 46 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d50cde...a930a11. Read the comment docs.

@vyasr vyasr requested a review from kkraus14 May 5, 2021 23:58
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 6, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented May 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 202fff1 into rapidsai:branch-0.20 May 6, 2021
rapids-bot bot pushed a commit that referenced this pull request May 18, 2021
Continuation of #8115 and #8166. Moves more logic out of the Index/Series classes into the new common parent class to reduce code duplication and ensure feature parity.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - https://github.com/brandon-b-miller
  - Michael Wang (https://github.com/isVoid)

URL: #8253
rapids-bot bot pushed a commit that referenced this pull request Jul 15, 2021
…e support (#8598)

This PR builds on #8166 to introduce a common code path for binary operations between all Frame types, not just Index and Series. The central component is a new internal API for operations between a sequence of pairs of columns (or column-like objects) that subclasses of Frame can call after performing any preprocessing necessary on the operands. The result is (in addition to the benefits of reducing total code and addressing some tech debt) substantial performance improvements for DataFrame binops, new features in the form of supporting binary operations between a much wider class of data types, and some minor bug fixes that were previously quite challenging to address. 

Resolves #4536.

***Performance***
By avoiding calling through to Series binary operations to perform binops between DataFrames (which requires lots of ColumnAccessor operations, construction of Series objects, and lots of indexing), we gain substantial improvements in DataFrame binops. On average I see at minimum a 20% improvement in binops between DataFrames, with significantly larger improvements (up to 60%) depending on data size, nullability, and other related factors. The performance impact of the new code paths for Series and Index binops is negligible, since the changes effectively amount to a loop over a single element dictionary instead of just operating on that element directly.

***New features***
The centralized binop code made it easy to insert an `as_column` at an appropriate point for the right operand to a binop, which enables support for a wide range of other objects. For instance, it is now possible to add numpy arrays or pandas Series objects to a cudf Series, which was not previously possible. This code also appropriately handles the return of `NotImplemented` for all data types, enabling reflected operations for other libraries that build on `cudf`.

***Bug fixes***
This PR improves the handling of NULL and NaN columns in a number of scenarios, for instance where a column is missing from one input or the other. Additionally, this PR fixes the behavior of RangeIndexes when they are multiplied by integers (pandas returns RangeIndex objects in that case). This fix is largely independent of the new code since it essentially intercepts scalar ints up front and reconstructs a new object.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/brandon-b-miller

URL: #8598
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
@vyasr vyasr deleted the refactor/frame_binops branch January 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants