-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask-CuDF: use default Dask Dataframe optimizer #8581
Dask-CuDF: use default Dask Dataframe optimizer #8581
Conversation
cc @rjzamora |
Interestingly this |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8581 +/- ##
===============================================
Coverage ? 82.61%
===============================================
Files ? 109
Lines ? 17850
Branches ? 0
===============================================
Hits ? 14747
Misses ? 3103
Partials ? 0 Continue to review full report at Codecov.
|
dsk, | ||
keys, | ||
dependencies=dependencies, | ||
ave_width=_globals.get("fuse_ave_width", 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering if we should be setting this somehow to ensure it is 1
. However that appears to be the default anyways. So that doesn't seem needed
The other cull
ing steps happen naturally as part of optimize
anyways so no need to reproduce them otherwise either
@gpucibot merge |
Thanks Mads! 😄 |
In order to use the new HighLevelGraph optimization work in Dask/Distributed, this PR makes
dask_cudf.Dataframes
use the default Dask optimizer.Previously, we have been explicitly materialized the HighLevelGraphs when calling
submit()
andcompute()
ondask_cudf.Dataframes
.Overall, this should improve performance but by default low-level task optimizations are disabled, which might have a negative impact. High-level optimizations are done in any case and we are working on moving all low-level optimization to high-level but currently low-level optimization such as array slicing is only supported by the low-level.
I don't think we will be missing any low-level optimizations related to Dataframes so I think we should follow Dask on this one and disable low-level optimizations by default.
It is possible to enable low-level optimizations explicitly by setting the Dask config like:
cc. @jakirkham, @quasiben, @beckernick, @VibhuJawa