-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Optimize DataFrame
creation across code-base
#10236
Conversation
Why do we need to support this constructor? Can we not just use |
To be clear, there are a ton of places in cudf that are doing inefficient things right now when it comes to materializing more intermediates than we need. We need to do a comprehensive review to remove things like this. |
After an offline discussion with @vyasr we decided to raise an error if someone passes a |
DataFrame
constructor for ColumnAccessor
inputsDataFrame
creation across code-base
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10236 +/- ##
================================================
+ Coverage 10.42% 10.47% +0.04%
================================================
Files 119 122 +3
Lines 20603 20506 -97
================================================
- Hits 2148 2147 -1
+ Misses 18455 18359 -96
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome happy with these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. Perf improvements look awesome, too!
@gpucibot merge |
After internal profiling, it appears that
_init_from_dict_like
was being repeatedly called even forColumnAccessor
inputs, which is not necessary and is an expensive method where we do proper index re-alignment. This PR handles by raising an error when a developer tries to create aDataFrame
withColumnAccessor
, and also changes multiple of those instances where such calls are present to_from_data
call which will now avoid going through_init_from_dict_like
method forColumnAccessor
input. For a dataframe of shape30_00_000 x 3
the speed up is about 2.5x to 4x.On
branch-22.04
:This
PR
: