-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REFACTOR-#3780: Remove code duplication for PandasOnDaskDataframe
#3781
Conversation
633716e
to
db6ada2
Compare
Codecov Report
@@ Coverage Diff @@
## master #3781 +/- ##
==========================================
+ Coverage 85.18% 89.83% +4.64%
==========================================
Files 265 266 +1
Lines 19626 19890 +264
==========================================
+ Hits 16719 17868 +1149
+ Misses 2907 2022 -885
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but did you run any benchmarks? While general code might work, it might do so slower.
No. |
I would rather saw this proved, rather than believe in the docs... I saw too much docs being liars to believe anything but firm fact measured by someone I can trust (like you 🙃) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this will probably make things slower. Instead of gathering all at once, this will get each one individually.
Good point. So the best solution here would be to take all the lengths at a time, through the appropriate engine interface. After which it will be necessary to fill the cache of separate partitions in a loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just add a check for the cache into these functions engine-specific? The lengths/widths cache should usually be there, but in the case it's not we'll see very expensive operation times.
…funcs in 'PandasOnDaskDataframe' Signed-off-by: Anatoly Myachev <[email protected]>
Signed-off-by: Anatoly Myachev <[email protected]>
This reverts commit d6dfe3f.
Signed-off-by: Myachev <[email protected]>
PandasOnDaskDataframe
@YarShev @vnlitvinov @mvashishtha please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @anmyachev ! Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice find!
@mvashishtha let's merge it |
Signed-off-by: Anatoly Myachev [email protected]
What do these changes do?
Now there is no need to override methods
_row_length
and_column_width
for Dask separately, since the default implementation allows us to gather remote objects at once #3780 (comment)flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
_row_lengths
and_column_widths
functions inPandasOnDaskDataframe
. #3780addedand passingdocs/developer/architecture.rst
is up-to-date