Misc Python/Cython optimizations #7686

shwina · 2021-03-23T16:10:46Z

This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks.

Merge after: #7660

Summary:

Adds a way to initialize a ColumnAccessor (_init_unsafe) without validating its input. This is useful when converting a cudf::table to a Frame, where we're guaranteed the columns are well formed
Improved (faster) is_numerical_dtype
Prioritize check for numeric dtypes in astype() and build_column(). Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way.

…e the columns in the accessor.

…onstructor.

…eature/optimize_accessor_copy

…ze_accessor_copy

…f into feature/optimize_accessor_copy

This reverts commit 72598fb.

shwina · 2021-03-23T16:14:55Z

python/cudf/cudf/core/column/column.py

@@ -1017,7 +1017,9 @@ def distinct_count(
        return cpp_distinct_count(self, ignore_nulls=dropna)

    def astype(self, dtype: Dtype, **kwargs) -> ColumnBase:
-        if is_categorical_dtype(dtype):
+        if is_numerical_dtype(dtype):


Numerical types being the most common [[citation needed]], and is_numerical_dtype now being quite fast, it makes sense to do this check first.

Still definitely pro doing this, but I'm working on prototyping the (amortized) constant-time approach I suggested and I'll update you once that's done. Hopefully that will make ordering concerns here largely moot.

python/cudf/cudf/utils/dtypes.py

…nto various-py-optimizations

…arious-py-optimizations

kkraus14 · 2021-03-23T22:10:52Z

python/cudf/cudf/utils/dtypes.py

+    # TODO: we should handle objects with a `.dtype` attribute,
+    # e.g., arrays, here.
+    try:
+        dtype = np.dtype(obj)


What if someone gives us a Pandas nullable integer type?

We certainly aren't handling this currently. On branch-0.19

>>> cudf.Series([1, 2, 3], dtype=pd.Int64Dtype()) # TypeError >>> cudf.utils.dtypes.is_numerical_dtype(pd.Int64Dtype()) # TypeError

I agree we shouldl support this. But how to do so in an efficient way is a difficult question. @vyasr and I were talking about this a couple of days ago, and he has some ideas for how to make dtype introspection faster/cheaper. We can perhaps take on this problem there?

codecov · 2021-03-23T23:50:58Z

Codecov Report

Merging #7686 (ec1c5c4) into branch-0.19 (7871e7a) will increase coverage by 0.66%.
The diff coverage is n/a.

❗ Current head ec1c5c4 differs from pull request most recent head eadcc9c. Consider uploading reports for the commit eadcc9c to get more accurate results

@@               Coverage Diff               @@
##           branch-0.19    #7686      +/-   ##
===============================================
+ Coverage        81.86%   82.52%   +0.66%     
===============================================
  Files              101      101              
  Lines            16884    17444     +560     
===============================================
+ Hits             13822    14396     +574     
+ Misses            3062     3048      -14

Impacted Files	Coverage Δ
python/cudf/cudf/core/buffer.py	`84.21% <ø> (+4.96%)`	⬆️
python/cudf/cudf/core/column/categorical.py	`91.97% <ø> (+0.58%)`	⬆️
python/cudf/cudf/core/column/column.py	`87.61% <ø> (-0.15%)`	⬇️
python/cudf/cudf/core/column/datetime.py	`89.63% <ø> (+0.54%)`	⬆️
python/cudf/cudf/core/column/decimal.py	`92.75% <ø> (-2.12%)`	⬇️
python/cudf/cudf/core/column/lists.py	`90.00% <ø> (-1.40%)`	⬇️
python/cudf/cudf/core/column/numerical.py	`94.83% <ø> (-0.20%)`	⬇️
python/cudf/cudf/core/column/string.py	`86.79% <ø> (+0.30%)`	⬆️
python/cudf/cudf/core/column/timedelta.py	`88.57% <ø> (+0.33%)`	⬆️
python/cudf/cudf/core/column_accessor.py	`96.13% <ø> (+0.82%)`	⬆️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f9f8f5...eadcc9c. Read the comment docs.

python/cudf/cudf/_lib/table.pyx

kkraus14 · 2021-03-24T16:26:09Z

@gpucibot merge

vyasr and others added 27 commits March 19, 2021 13:48

Move validation directly into set_by_label and use a raw dict to stor…

935648b

…e the columns in the accessor.

Remove all references to OrderedColumnDict.

806a3ef

Move validation to separate method and use in both set_by_label and c…

40a7b17

…onstructor.

Format with black.

a1c576e

Expose parameter to make validation optional.

788d9d6

Coerce constructor input to dict before calling items.

6a64285

Make construction safe.

e7d0981

Final cleanup and documentation.

c39932c

Address style issues.

4ff09fc

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into f…

9433582

…eature/optimize_accessor_copy

Merge remote-tracking branch 'origin/branch-0.19' into feature/optimi…

74f2884

…ze_accessor_copy

CA fix

0178127

Prioritize numeric columns

efea63d

Lazily compute and delete column length on demand.

c3b6444

Remove redundant clear cache in setitem.

01b2cf5

Remove mypy annotation for column length.

8899258

Merge branch 'feature/optimize_accessor_copy' of github.com:vyasr/cud…

3507785

…f into feature/optimize_accessor_copy

Undo

7f8e1cd

Don't validate when copying type metadata

f2e4609

Prioritize numeric dtypes in is_numerical_dtype

72598fb

Add unsafe CA ctor

fa220b6

Revert "Prioritize numeric dtypes in is_numerical_dtype"

3760077

This reverts commit 72598fb.

Change error message back so that tests pass.

de9ca28

Faster is_numerical_dtype

e35d03b

Faster is_numerical_dtype

e2fd533

Even faster is_numerical_dtype

64ca702

Enable fast path for constructing a Buffer from a DeviceBuffer

749edf1

github-actions bot added the Python Affects Python cuDF API. label Mar 23, 2021

shwina commented Mar 23, 2021

View reviewed changes

kkraus14 reviewed Mar 23, 2021

View reviewed changes

python/cudf/cudf/utils/dtypes.py Show resolved Hide resolved

Add validation option to insert and standardize error message.

739ec57

shwina added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 23, 2021

vyasr and others added 5 commits March 23, 2021 10:22

Fix style.

498b70e

Merge remote-tracking branch 'vyasr/feature/optimize_accessor_copy' i…

3cd012b

…nto various-py-optimizations

Merge branch 'branch-0.19' of https://github.com/rapidsai/cudf into v…

c28866c

…arious-py-optimizations

Undo formatting change

01e13fa

Add TODO

89a0301

kkraus14 reviewed Mar 23, 2021

View reviewed changes

init->create + doc

5e73de7

shwina marked this pull request as ready for review March 24, 2021 00:22

shwina requested a review from a team as a code owner March 24, 2021 00:22

shwina requested review from cwharris and galipremsagar March 24, 2021 00:22

kkraus14 reviewed Mar 24, 2021

View reviewed changes

python/cudf/cudf/_lib/table.pyx Outdated Show resolved Hide resolved

shwina added 2 commits March 24, 2021 10:28

Use dict comprehension instead of building list

a4fe7b4

Use enumeration instead

eadcc9c

kkraus14 approved these changes Mar 24, 2021

View reviewed changes

kkraus14 added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Mar 24, 2021

rapids-bot bot merged commit e73fff0 into rapidsai:branch-0.19 Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc Python/Cython optimizations #7686

Misc Python/Cython optimizations #7686

shwina commented Mar 23, 2021 •

edited

Loading

shwina Mar 23, 2021

vyasr Mar 23, 2021 •

edited

Loading

kkraus14 Mar 23, 2021

shwina Mar 23, 2021

codecov bot commented Mar 23, 2021 •

edited

Loading

kkraus14 commented Mar 24, 2021

Misc Python/Cython optimizations #7686

Misc Python/Cython optimizations #7686

Conversation

shwina commented Mar 23, 2021 • edited Loading

shwina Mar 23, 2021

Choose a reason for hiding this comment

vyasr Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

kkraus14 Mar 23, 2021

Choose a reason for hiding this comment

shwina Mar 23, 2021

Choose a reason for hiding this comment

codecov bot commented Mar 23, 2021 • edited Loading

Codecov Report

kkraus14 commented Mar 24, 2021

shwina commented Mar 23, 2021 •

edited

Loading

vyasr Mar 23, 2021 •

edited

Loading

codecov bot commented Mar 23, 2021 •

edited

Loading