Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Apply row-wise filtering for filters in cudf/dask_cudf.read_parquet #13324

Closed
rjzamora opened this issue May 9, 2023 · 0 comments · Fixed by #13334
Closed

[FEA] Apply row-wise filtering for filters in cudf/dask_cudf.read_parquet #13324

rjzamora opened this issue May 9, 2023 · 0 comments · Fixed by #13334
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.

Comments

@rjzamora
Copy link
Member

rjzamora commented May 9, 2023

Is your feature request related to a problem? Please describe.
The current cudf/dask_cudf.read_parquet APIs accept a filters argument. These filters are only used to drop data at the row-group level. This means that cudf.read_parquet(path, filters=[("x", "==", 10])) is not guaranteed to produce the same result as df = cudf.read_parquet(path); df[df["x"] == 10].

Although it makes sense to use parquet statistics to filter out data at the row-group level, I feel that cudf should enforce the filters on all rows before returning the data to the user (even if those rows still needed to be read in to memory first).

Describe the solution you'd like
Although it would be nice to apply filters in cuio/libcudf, it seems perfectly reasonable to simply convert DNF-formatted filter expressions into cudf/python operations, and apply those operations on the data before returning.

Describe alternatives you've considered
Alternative is the status quo: Passing in filters provides no guarantee that the returned data will satisfy the provided filters.

Additional context
The primary motivation here is the new Dask Expressions (dask-expr) library. It is much easier to implement a predicate-pushdown optimization if a Filter expression can be completely absorbed by a ReadParquet expression by converting the distinct filtering operation into a ReadParquet argument.

@rjzamora rjzamora added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. dask Dask issue labels May 9, 2023
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Dask issue feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants