-
Notifications
You must be signed in to change notification settings - Fork 919
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add row-wise filtering step to
read_parquet
(#13334)
This PR adds a `post_filters` argument to `cudf.read_parquet`, which is set equal to the `filters` argument by default. When this argument is set, the specified DNF (disjunctive normal form) filter expression will be applied to the in-memory `cudf.DataFrame` object after IO is performed. The overal result of this PR is that the behavior of `cudf.read_parquet` becomes more consistent with that of `pd.read_parquet` in the sense that the default result will now enforce filters at a row-wise granularity for both libraries. ### Note on the "need" for distinct `filters` and `post_filters` arguments My hope is that `post_filters` will eventually go away. However, I added a distinct argument for two general reasons: 1. PyArrow does not yet support `"is"` and `"is not"` operands in `filters`. Therefore, we can not pass along **all** filters from `dask`/`dask_cudf` down to `cudf.read_parquet` using the existing `filters` argument, because we rely on pyarrow to filter out row-groups (note that dask implements its own filter-conversion utility to avoid this problem). I'm hoping pyarrow will eventually adopt these comparison types (xref: apache/arrow#34504) 2. When `cudf.read_parquet` is called from `dask_cudf.read_parquet`, row-group filtering will have already been applied. Therefore, it is convenient to specify that you only need cudf to provide the post-IO row-wise filtering step. Otherwise, we are effectively duplicating some metadata processing. My primary concern with adding `post_filters` is the idea that row-wise filtering *could* be added at the cuio/libcudf level in the future. In that (hypothetical) case, `post_filters` wouldn't really be providing any value, but we would probably be able to deprecate it without much pain (if any). ## Additional Context This feature is ultimately **needed** to support general predicate-pushdown optimizations in Dask Expressions (dask-expr). This is because the high-level optimization logic becomes much simpler when a filter-based operation of a `ReadParquet` expression can be iteratively "absorbed" into the root `ReadParquet` expression. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13334
- Loading branch information
Showing
5 changed files
with
192 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters