Dask-CuDF: use default Dask Dataframe optimizer #8581

madsbk · 2021-06-22T13:18:22Z

In order to use the new HighLevelGraph optimization work in Dask/Distributed, this PR makes dask_cudf.Dataframes use the default Dask optimizer.
Previously, we have been explicitly materialized the HighLevelGraphs when calling submit() and compute() on dask_cudf.Dataframes.

Overall, this should improve performance but by default low-level task optimizations are disabled, which might have a negative impact. High-level optimizations are done in any case and we are working on moving all low-level optimization to high-level but currently low-level optimization such as array slicing is only supported by the low-level.

I don't think we will be missing any low-level optimizations related to Dataframes so I think we should follow Dask on this one and disable low-level optimizations by default.
It is possible to enable low-level optimizations explicitly by setting the Dask config like:

dask.config.set({"optimization.fuse.active": True})

cc. @jakirkham, @quasiben, @beckernick, @VibhuJawa

jakirkham · 2021-06-22T17:10:22Z

cc @rjzamora

jakirkham · 2021-06-22T17:14:24Z

Interestingly this optimize function has been around since Jim initial setup Dask-cuDF (called Dask-GDF at the time). Am curious why it was needed. @jcrist do you happen to remember? If not, no worries 🙂

madsbk · 2021-06-23T07:56:37Z

rerun tests

codecov · 2021-06-23T09:48:32Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@aa67b13). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 75194d7 differs from pull request most recent head 3fc25c8. Consider uploading reports for the commit 3fc25c8 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08    #8581   +/-   ##
===============================================
  Coverage                ?   82.61%           
===============================================
  Files                   ?      109           
  Lines                   ?    17850           
  Branches                ?        0           
===============================================
  Hits                    ?    14747           
  Misses                  ?     3103           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa67b13...3fc25c8. Read the comment docs.

jakirkham · 2021-06-23T21:22:32Z

python/dask_cudf/dask_cudf/core.py

-        dsk,
-        keys,
-        dependencies=dependencies,
-        ave_width=_globals.get("fuse_ave_width", 1),


Was wondering if we should be setting this somehow to ensure it is 1. However that appears to be the default anyways. So that doesn't seem needed

The other culling steps happen naturally as part of optimize anyways so no need to reproduce them otherwise either

jakirkham · 2021-06-23T21:22:59Z

@gpucibot merge

jakirkham · 2021-06-23T21:23:12Z

Thanks Mads! 😄

Use default Dask optimizer

72a3837

madsbk added 2 - In Progress Currently a work in progress Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 22, 2021

github-actions bot added the Python Affects Python cuDF API. label Jun 22, 2021

fixed import typo

3fc25c8

madsbk marked this pull request as ready for review June 23, 2021 10:03

madsbk requested a review from a team as a code owner June 23, 2021 10:03

madsbk added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 23, 2021

jakirkham reviewed Jun 23, 2021

View reviewed changes

jakirkham approved these changes Jun 23, 2021

View reviewed changes

rapids-bot bot merged commit 99808ab into rapidsai:branch-21.08 Jun 23, 2021

madsbk mentioned this pull request Jul 2, 2021

Several queries failing rapidsai/gpu-bdb#219

Closed

madsbk deleted the dask_cudf_use_default_dask_optimizer branch April 5, 2022 10:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask-CuDF: use default Dask Dataframe optimizer #8581

Dask-CuDF: use default Dask Dataframe optimizer #8581

madsbk commented Jun 22, 2021 •

edited

Loading

jakirkham commented Jun 22, 2021

jakirkham commented Jun 22, 2021

madsbk commented Jun 23, 2021

codecov bot commented Jun 23, 2021

jakirkham Jun 23, 2021

jakirkham commented Jun 23, 2021

jakirkham commented Jun 23, 2021

Dask-CuDF: use default Dask Dataframe optimizer #8581

Dask-CuDF: use default Dask Dataframe optimizer #8581

Conversation

madsbk commented Jun 22, 2021 • edited Loading

jakirkham commented Jun 22, 2021

jakirkham commented Jun 22, 2021

madsbk commented Jun 23, 2021

codecov bot commented Jun 23, 2021

Codecov Report

jakirkham Jun 23, 2021

Choose a reason for hiding this comment

jakirkham commented Jun 23, 2021

jakirkham commented Jun 23, 2021

madsbk commented Jun 22, 2021 •

edited

Loading