Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc Python/Cython optimizations #7686

Merged
merged 36 commits into from
Mar 24, 2021

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Mar 23, 2021

This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks.

Merge after: #7660

Summary:

  • Adds a way to initialize a ColumnAccessor (_init_unsafe) without validating its input. This is useful when converting a cudf::table to a Frame, where we're guaranteed the columns are well formed
  • Improved (faster) is_numerical_dtype
  • Prioritize check for numeric dtypes in astype() and build_column(). Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way.

vyasr and others added 27 commits March 19, 2021 13:48
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 23, 2021
@@ -1017,7 +1017,9 @@ def distinct_count(
return cpp_distinct_count(self, ignore_nulls=dropna)

def astype(self, dtype: Dtype, **kwargs) -> ColumnBase:
if is_categorical_dtype(dtype):
if is_numerical_dtype(dtype):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numerical types being the most common [[citation needed]], and is_numerical_dtype now being quite fast, it makes sense to do this check first.

Copy link
Contributor

@vyasr vyasr Mar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still definitely pro doing this, but I'm working on prototyping the (amortized) constant-time approach I suggested and I'll update you once that's done. Hopefully that will make ordering concerns here largely moot.

@shwina shwina added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 23, 2021
# TODO: we should handle objects with a `.dtype` attribute,
# e.g., arrays, here.
try:
dtype = np.dtype(obj)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone gives us a Pandas nullable integer type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly aren't handling this currently. On branch-0.19

>>> cudf.Series([1, 2, 3], dtype=pd.Int64Dtype())  # TypeError
>>> cudf.utils.dtypes.is_numerical_dtype(pd.Int64Dtype()) # TypeError

I agree we shouldl support this. But how to do so in an efficient way is a difficult question. @vyasr and I were talking about this a couple of days ago, and he has some ideas for how to make dtype introspection faster/cheaper. We can perhaps take on this problem there?

@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #7686 (ec1c5c4) into branch-0.19 (7871e7a) will increase coverage by 0.66%.
The diff coverage is n/a.

❗ Current head ec1c5c4 differs from pull request most recent head eadcc9c. Consider uploading reports for the commit eadcc9c to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7686      +/-   ##
===============================================
+ Coverage        81.86%   82.52%   +0.66%     
===============================================
  Files              101      101              
  Lines            16884    17444     +560     
===============================================
+ Hits             13822    14396     +574     
+ Misses            3062     3048      -14     
Impacted Files Coverage Δ
python/cudf/cudf/core/buffer.py 84.21% <ø> (+4.96%) ⬆️
python/cudf/cudf/core/column/categorical.py 91.97% <ø> (+0.58%) ⬆️
python/cudf/cudf/core/column/column.py 87.61% <ø> (-0.15%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.63% <ø> (+0.54%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <ø> (-2.12%) ⬇️
python/cudf/cudf/core/column/lists.py 90.00% <ø> (-1.40%) ⬇️
python/cudf/cudf/core/column/numerical.py 94.83% <ø> (-0.20%) ⬇️
python/cudf/cudf/core/column/string.py 86.79% <ø> (+0.30%) ⬆️
python/cudf/cudf/core/column/timedelta.py 88.57% <ø> (+0.33%) ⬆️
python/cudf/cudf/core/column_accessor.py 96.13% <ø> (+0.82%) ⬆️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f9f8f5...eadcc9c. Read the comment docs.

@shwina shwina marked this pull request as ready for review March 24, 2021 00:22
@shwina shwina requested a review from a team as a code owner March 24, 2021 00:22
@kkraus14 kkraus14 added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 24, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e73fff0 into rapidsai:branch-0.19 Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants