[Python] Add support for "is" and "is not" to `pyarrow.parquet.filters_to_expression` #34504

rjzamora · 2023-03-08T18:30:38Z

Describe the enhancement requested

As discussed in dask/dask#9845, it is not currently possible to pass a List[Tuple]-formatted filter to filters_to_expression that can be converted into an expression that will filter null/nan values correctly. This is because something like [("a", "!=", np.nan) will result in <pyarrow.compute.Expression (a != nan)> instead of <pyarrow.compute.Expression invert(is_null(a, {nan_is_null=true}))>.

@j-bennet suggested that we add support for "is" and "is not" operators for this, and @jorisvandenbossche seemed receptive to this idea.

Since this was a blocking issue for some RAPIDS users, dask has temporarily added its own custom implementation of filters_to_expression. Is there any interest in adopting a similar version of this utility in pyarrow?

Note: It would be nice if the filters_to_expression utility made it possible to mimic pandas behavior and avoid null propagation for predicates like [("a", "!=", 1)] (by taking such a predicate to mean <pyarrow.compute.Expression (is_null(a, {nan_is_null=true}) or (a != 1))> on the arrow side). Dask's custom implementation adds the propagate_null argument for this purpose.

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

This PR adds a `post_filters` argument to `cudf.read_parquet`, which is set equal to the `filters` argument by default. When this argument is set, the specified DNF (disjunctive normal form) filter expression will be applied to the in-memory `cudf.DataFrame` object after IO is performed. The overal result of this PR is that the behavior of `cudf.read_parquet` becomes more consistent with that of `pd.read_parquet` in the sense that the default result will now enforce filters at a row-wise granularity for both libraries. ### Note on the "need" for distinct `filters` and `post_filters` arguments My hope is that `post_filters` will eventually go away. However, I added a distinct argument for two general reasons: 1. PyArrow does not yet support `"is"` and `"is not"` operands in `filters`. Therefore, we can not pass along **all** filters from `dask`/`dask_cudf` down to `cudf.read_parquet` using the existing `filters` argument, because we rely on pyarrow to filter out row-groups (note that dask implements its own filter-conversion utility to avoid this problem). I'm hoping pyarrow will eventually adopt these comparison types (xref: apache/arrow#34504) 2. When `cudf.read_parquet` is called from `dask_cudf.read_parquet`, row-group filtering will have already been applied. Therefore, it is convenient to specify that you only need cudf to provide the post-IO row-wise filtering step. Otherwise, we are effectively duplicating some metadata processing. My primary concern with adding `post_filters` is the idea that row-wise filtering *could* be added at the cuio/libcudf level in the future. In that (hypothetical) case, `post_filters` wouldn't really be providing any value, but we would probably be able to deprecate it without much pain (if any). ## Additional Context This feature is ultimately **needed** to support general predicate-pushdown optimizations in Dask Expressions (dask-expr). This is because the high-level optimization logic becomes much simpler when a filter-based operation of a `ReadParquet` expression can be iteratively "absorbed" into the root `ReadParquet` expression. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13334

JacekPliszka · 2023-11-16T22:29:47Z

Hmm, I suggested something similar #38750 just not for 'is' or 'is not' but
for is_nan, is_null and is_valid as they are already implemented in pyarrow.dataset.Expression https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html

rjzamora · 2023-11-17T15:01:35Z

Hmm, I suggested something similar #38750 just not for 'is' or 'is not' but
for is_nan, is_null and is_valid as they are already implemented in pyarrow.dataset.Expression https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html

Thanks for linking @JacekPliszka - I believe that solution would also satisfy our needs as well.

JacekPliszka · 2023-11-18T18:19:25Z

I made the change and prepared PR but this functions is marked with DeprecationWarning so I have not even worked on the tests.
Possibly it will not be approved and the function removed.

I am using the function in production code so I will just move it to my project where I have it extended a bit to handle type casting

rjzamora added the Type: enhancement label Mar 8, 2023

github-actions bot added Component: Parquet Component: Python labels Mar 8, 2023

rjzamora mentioned this issue Mar 8, 2023

read_parquet filter bug with nulls dask/dask#9845

Closed

rjzamora mentioned this issue May 10, 2023

Add row-wise filtering step to read_parquet rapidsai/cudf#13334

Merged

3 tasks

JacekPliszka mentioned this issue Nov 18, 2023

GH-38750: [Python] Adding 3 operators to DNF filter expression #38774

Open

giacomorebecchi mentioned this issue Jun 14, 2024

Allow pyarrow.dataset.Expression in filters kwarg of deltalake.DeltaTable.to_pyarrow_table delta-io/delta-rs#2597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Add support for "is" and "is not" to `pyarrow.parquet.filters_to_expression` #34504

[Python] Add support for "is" and "is not" to `pyarrow.parquet.filters_to_expression` #34504

rjzamora commented Mar 8, 2023

JacekPliszka commented Nov 16, 2023

rjzamora commented Nov 17, 2023

JacekPliszka commented Nov 18, 2023

[Python] Add support for "is" and "is not" to pyarrow.parquet.filters_to_expression #34504

[Python] Add support for "is" and "is not" to pyarrow.parquet.filters_to_expression #34504

Comments

rjzamora commented Mar 8, 2023

Describe the enhancement requested

Component(s)

JacekPliszka commented Nov 16, 2023

rjzamora commented Nov 17, 2023

JacekPliszka commented Nov 18, 2023

[Python] Add support for "is" and "is not" to `pyarrow.parquet.filters_to_expression` #34504

[Python] Add support for "is" and "is not" to `pyarrow.parquet.filters_to_expression` #34504