[FEA] Apply row-wise filtering for `filters` in `cudf/dask_cudf.read_parquet` #13324

rjzamora · 2023-05-09T21:53:41Z

Is your feature request related to a problem? Please describe.
The current cudf/dask_cudf.read_parquet APIs accept a filters argument. These filters are only used to drop data at the row-group level. This means that cudf.read_parquet(path, filters=[("x", "==", 10])) is not guaranteed to produce the same result as df = cudf.read_parquet(path); df[df["x"] == 10].

Although it makes sense to use parquet statistics to filter out data at the row-group level, I feel that cudf should enforce the filters on all rows before returning the data to the user (even if those rows still needed to be read in to memory first).

Describe the solution you'd like
Although it would be nice to apply filters in cuio/libcudf, it seems perfectly reasonable to simply convert DNF-formatted filter expressions into cudf/python operations, and apply those operations on the data before returning.

Describe alternatives you've considered
Alternative is the status quo: Passing in filters provides no guarantee that the returned data will satisfy the provided filters.

Additional context
The primary motivation here is the new Dask Expressions (dask-expr) library. It is much easier to implement a predicate-pushdown optimization if a Filter expression can be completely absorbed by a ReadParquet expression by converting the distinct filtering operation into a ReadParquet argument.

The text was updated successfully, but these errors were encountered:

rjzamora added feature request New feature or request Needs Triage Need team to review and classify Python Affects Python cuDF API. dask Dask issue labels May 9, 2023

rjzamora mentioned this issue May 10, 2023

Add row-wise filtering step to read_parquet #13334

Merged

3 tasks

rapids-bot bot closed this as completed in #13334 May 17, 2023

bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Apply row-wise filtering for `filters` in `cudf/dask_cudf.read_parquet` #13324

[FEA] Apply row-wise filtering for `filters` in `cudf/dask_cudf.read_parquet` #13324

rjzamora commented May 9, 2023

[FEA] Apply row-wise filtering for filters in cudf/dask_cudf.read_parquet #13324

[FEA] Apply row-wise filtering for filters in cudf/dask_cudf.read_parquet #13324

Comments

rjzamora commented May 9, 2023

[FEA] Apply row-wise filtering for `filters` in `cudf/dask_cudf.read_parquet` #13324

[FEA] Apply row-wise filtering for `filters` in `cudf/dask_cudf.read_parquet` #13324