-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] DataFrame insert
and creation optimizations
#10285
[REVIEW] DataFrame insert
and creation optimizations
#10285
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10285 +/- ##
================================================
+ Coverage 10.42% 10.67% +0.24%
================================================
Files 119 122 +3
Lines 20603 20878 +275
================================================
+ Hits 2148 2228 +80
- Misses 18455 18650 +195
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Mostly just suggestions to remove the default parameter since we changed it and that should be the default expected behavior for _insert
. Other than that, one suggestion where you might be able to optimize a constructor. I'll leave it to you to finalize.
value = Series(value, nan_as_null=nan_as_null)._align_to_index( | ||
self._index, how="right", sort=False | ||
) | ||
value = Series(value, nan_as_null=nan_as_null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a faster way to construct this if we know the input is a cudf.Series
. I'm not sure how much we could save for a pandas Series by handling it manually, I don't think we typically do anything special for those. That may be worth exploring in a future PR (basically seeing if we can implement Series.from_pandas
in a more efficient manner than just calling the constructor), but is out of scope for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it is out of scope for this PR. The reason is we have other places in the code-base which use similar patterns we might be better off tackling that in a separate PR all at a time.
@gpucibot merge |
This PR removes double index equality & reindexing checks in
DataFrame
construction andinsert
code-flows.scalar_broadcast_to
, a 2.2x speedup forstr
dtype, numeric types perf remains the same:assign
, a 1.2x-1.4x speedup:DataFrame constructor, a 2x speedup: