-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misc Python/Cython optimizations #7686
Misc Python/Cython optimizations #7686
Conversation
…e the columns in the accessor.
…eature/optimize_accessor_copy
…f into feature/optimize_accessor_copy
This reverts commit 72598fb.
@@ -1017,7 +1017,9 @@ def distinct_count( | |||
return cpp_distinct_count(self, ignore_nulls=dropna) | |||
|
|||
def astype(self, dtype: Dtype, **kwargs) -> ColumnBase: | |||
if is_categorical_dtype(dtype): | |||
if is_numerical_dtype(dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numerical types being the most common [[citation needed]], and is_numerical_dtype
now being quite fast, it makes sense to do this check first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still definitely pro doing this, but I'm working on prototyping the (amortized) constant-time approach I suggested and I'll update you once that's done. Hopefully that will make ordering concerns here largely moot.
…nto various-py-optimizations
…arious-py-optimizations
# TODO: we should handle objects with a `.dtype` attribute, | ||
# e.g., arrays, here. | ||
try: | ||
dtype = np.dtype(obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone gives us a Pandas nullable integer type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly aren't handling this currently. On branch-0.19
>>> cudf.Series([1, 2, 3], dtype=pd.Int64Dtype()) # TypeError
>>> cudf.utils.dtypes.is_numerical_dtype(pd.Int64Dtype()) # TypeError
I agree we shouldl support this. But how to do so in an efficient way is a difficult question. @vyasr and I were talking about this a couple of days ago, and he has some ideas for how to make dtype introspection faster/cheaper. We can perhaps take on this problem there?
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7686 +/- ##
===============================================
+ Coverage 81.86% 82.52% +0.66%
===============================================
Files 101 101
Lines 16884 17444 +560
===============================================
+ Hits 13822 14396 +574
+ Misses 3062 3048 -14
Continue to review full report at Codecov.
|
@gpucibot merge |
This PR introduces various small optimizations that should generally improve various common Python overhead. See #7454 (comment) for the motivation behind these optimizations and some benchmarks.
Merge after: #7660
Summary:
cudf::table
to aFrame
, where we're guaranteed the columns are well formedis_numerical_dtype
astype()
andbuild_column()
. Numeric types are presumably more common, and we can avoid expensive checks for other dtypes this way.