Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring column logic Part 1 #8081

Merged
merged 12 commits into from
Apr 28, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Apr 27, 2021

This PR is a first pass at refactoring ColumnBase and its subclasses to reduce redundancy and improve performance by avoiding runtime type checking. Many functions are implemented in the top-level class but dispatch on dtype, which can instead be accomplished via ducktyping. Additionally, other parts of cudf require various methods to be implemented by a column, but ColumnBase does not currently clearly delineate an interface, making it difficult to know what to rely on in classes like Frame and pushing dynamic type dispatch upstream in the call stack where it is even less efficient and causes substantial code duplication. This PR moves specialized implementations of certain methods into the appropriate subclasses of ColumnBase and establishes a base API in the parent class where appropriate. Since this change will be large, I plan to split it into a few different PRs. This PR primarily modifies to_pandas, to_arrow, and __cuda_array_interface__, along with a few other minor improvements.

@vyasr vyasr added 3 - Ready for Review Ready for review by team code quality Python Affects Python cuDF API. Performance Performance related issue improvement Improvement / enhancement to an existing function labels Apr 27, 2021
@vyasr vyasr self-assigned this Apr 27, 2021
@vyasr vyasr requested a review from a team as a code owner April 27, 2021 16:35
@vyasr vyasr added the non-breaking Non-breaking change label Apr 27, 2021
Copy link
Contributor

@shwina shwina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

python/cudf/cudf/core/column/column.py Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/datetime.py Show resolved Hide resolved
python/cudf/cudf/core/column/interval.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
python/cudf/cudf/core/column/string.py Show resolved Hide resolved
python/cudf/cudf/core/column/numerical.py Show resolved Hide resolved
python/cudf/cudf/core/column/numerical.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/numerical.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Apr 27, 2021

Codecov Report

Merging #8081 (381d15f) into branch-0.20 (51336df) will decrease coverage by 0.01%.
The diff coverage is 87.66%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #8081      +/-   ##
===============================================
- Coverage        82.88%   82.87%   -0.02%     
===============================================
  Files              103      103              
  Lines            17668    17835     +167     
===============================================
+ Hits             14645    14781     +136     
- Misses            3023     3054      +31     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/io/orc.py 86.89% <ø> (ø)
python/cudf/cudf/utils/cudautils.py 57.75% <25.00%> (ø)
python/cudf/cudf/utils/dtypes.py 81.87% <41.66%> (-1.57%) ⬇️
python/cudf/cudf/core/tools/datetimes.py 80.42% <75.32%> (-4.11%) ⬇️
python/cudf/cudf/core/column/column.py 88.20% <76.19%> (-0.44%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 91.55% <76.92%> (+0.11%) ⬆️
python/cudf/cudf/core/column/decimal.py 90.29% <77.27%> (-2.64%) ⬇️
python/dask_cudf/dask_cudf/backends.py 89.51% <80.00%> (-0.08%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.72% <91.17%> (+0.29%) ⬆️
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bfd6fe...381d15f. Read the comment docs.

@vyasr vyasr requested a review from kkraus14 April 27, 2021 23:15
@vyasr vyasr requested a review from kkraus14 April 28, 2021 16:23
@vyasr
Copy link
Contributor Author

vyasr commented Apr 28, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0ca0e69 into rapidsai:branch-0.20 Apr 28, 2021
@vyasr vyasr mentioned this pull request Apr 30, 2021
rapids-bot bot pushed a commit that referenced this pull request May 17, 2021
Follow up to #8081. This PR 1) inlines the binary op functions in different columns (which I presume were mainly created for profiling with nvtx), 2) moves all categorical column concatenation logic into the CategoricalColumn class, which allows the deletion of certain unimplemented functions in ColumnBase that only existed to satisfy mypy, and 3) performs some cleanup in the form of removing unused properties and using the `cached_property` decorator where appropriate.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Ashwin Srinath (https://github.com/shwina)
  - Michael Wang (https://github.com/isVoid)

URL: #8130
@vyasr vyasr added this to the cuDF Python Refactoring milestone Jul 22, 2021
@vyasr vyasr deleted the refactor/reduce_column_dup branch January 14, 2022 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants