Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce basic "cudf" backend for Dask Expressions #14805

Merged
merged 136 commits into from
Mar 11, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Jan 19, 2024

Description

Mostly addresses #15027

dask/dask-expr#728 exposed the necessary mechanisms for us to define a custom dask-expr backend for cudf. The new dispatching mechanisms are effectively the same as those in dask.dataframe. The only difference is that we are now registering/implementing "expression-based" collections.

This PR does the following:

  • Defines a basic DataFrameBackendEntrypoint class for collection creation, and registers new collections using get_collection_type.
  • Refactors the dask_cudf import structure to properly support the "dataframe.query-planning" configuration.
  • Modifies CI to test dask-expr support for some of the dask_cudf tests. This coverage can be expanded in follow-up work.

Experimental Change: This PR patches dask_expr._expr.Expr.__new__ to enable type-based dispatching. This effectively allows us to surgically replace problematic Expr subclasses that do not work for cudf-backed data. For example, this PR replaces the upstream TakeLast expression to avoid using squeeze (since this method is not supported by cudf). This particular fix can be moved upstream relatively easily. However, having this kind of "patching" mechanism may be valuable for more complicated pandas/cudf discrepancies.

Usage example

from dask import config
config.set({"dataframe.query-planning": True})
import dask_cudf

df = dask_cudf.DataFrame.from_dict(
    {"x": range(100), "y":  [1, 2, 3, 4] * 25, "z": ["1", "2"] * 50},
    npartitions=10,
)
df["y2"] = df["x"] + df["y"]
agg = df.groupby("y").agg({"y2": "mean"})["y2"]
agg.simplify().pprint()

Dask cuDF should now be using dask-expr for "query planning":

Projection: columns='y2'
  GroupbyAggregation: arg={'y2': 'mean'} observed=True split_out=1'y'
    Assign: y2=
      Projection: columns=['y']
        FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']
      Add:
        Projection: columns='x'
          FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']
        Projection: columns='y'
          FromPandas: frame='<dataframe>' npartitions=10 columns=['x', 'y']

TODO

  • Add basic tests
  • Confirm that general design makes sense

Follow Up Work:

  • Expand dask-expr test coverage
  • Fix local and upstream bugs
  • Add documentation once "critical mass" is reached

@rjzamora rjzamora added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 19, 2024
@rjzamora rjzamora self-assigned this Jan 19, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jan 19, 2024
Copy link

copy-pr-bot bot commented Jan 29, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added CMake CMake build issue conda labels Jan 29, 2024
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team labels Mar 5, 2024
Copy link
Member

@raydouglass raydouglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops-codeowners changes look good

@rjzamora
Copy link
Member Author

/merge

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Mar 10, 2024
@wence-
Copy link
Contributor

wence- commented Mar 11, 2024

Test fails are due to a combo of #15261 and #15265.

@bdice
Copy link
Contributor

bdice commented Mar 11, 2024

I'll merge the upstream so that the /merge command is triggered. (It was blocked by #15265)

@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit e2fcf12 into rapidsai:branch-24.04 Mar 11, 2024
73 checks passed
@rjzamora rjzamora deleted the new-dask-expr-backend branch March 11, 2024 22:04
@bdice bdice mentioned this pull request Mar 19, 2024
3 tasks
rapids-bot bot pushed a commit that referenced this pull request Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants