Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add support for "is" and "is not" to pyarrow.parquet.filters_to_expression #34504

Open
rjzamora opened this issue Mar 8, 2023 · 3 comments

Comments

@rjzamora
Copy link
Contributor

rjzamora commented Mar 8, 2023

Describe the enhancement requested

As discussed in dask/dask#9845, it is not currently possible to pass a List[Tuple]-formatted filter to filters_to_expression that can be converted into an expression that will filter null/nan values correctly. This is because something like [("a", "!=", np.nan) will result in <pyarrow.compute.Expression (a != nan)> instead of <pyarrow.compute.Expression invert(is_null(a, {nan_is_null=true}))>.

@j-bennet suggested that we add support for "is" and "is not" operators for this, and @jorisvandenbossche seemed receptive to this idea.

Since this was a blocking issue for some RAPIDS users, dask has temporarily added its own custom implementation of filters_to_expression. Is there any interest in adopting a similar version of this utility in pyarrow?

Note: It would be nice if the filters_to_expression utility made it possible to mimic pandas behavior and avoid null propagation for predicates like [("a", "!=", 1)] (by taking such a predicate to mean <pyarrow.compute.Expression (is_null(a, {nan_is_null=true}) or (a != 1))> on the arrow side). Dask's custom implementation adds the propagate_null argument for this purpose.

Component(s)

Parquet, Python

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue May 17, 2023
This PR adds a `post_filters` argument to `cudf.read_parquet`, which is set equal to the `filters` argument by default. When this argument is set, the specified DNF (disjunctive normal form) filter expression will be applied to the in-memory `cudf.DataFrame` object after IO is performed.

The overal result of this PR is that the behavior of `cudf.read_parquet` becomes more consistent with that of `pd.read_parquet` in the sense that the default result will now enforce filters at a row-wise granularity for both libraries.

### Note on the "need" for distinct `filters` and `post_filters` arguments

My hope is that `post_filters` will eventually go away. However, I added a distinct argument for two general reasons:

1. PyArrow does not yet support `"is"` and `"is not"` operands in `filters`.  Therefore, we can not pass along **all** filters from `dask`/`dask_cudf` down to `cudf.read_parquet` using the existing `filters` argument, because we rely on pyarrow to filter out row-groups (note that dask implements its own filter-conversion utility to avoid this problem). I'm hoping pyarrow will eventually adopt these comparison types (xref: apache/arrow#34504)
2. When `cudf.read_parquet` is called from `dask_cudf.read_parquet`, row-group filtering will have already been applied. Therefore, it is convenient to specify that you only need cudf to provide the post-IO row-wise filtering step. Otherwise, we are effectively duplicating some metadata processing.

My primary concern with adding `post_filters` is the idea that row-wise filtering *could* be added at the cuio/libcudf level in the future. In that (hypothetical) case, `post_filters` wouldn't really be providing any value, but we would probably  be able to deprecate it without much pain (if any).

## Additional Context

This feature is ultimately **needed** to support general predicate-pushdown optimizations in Dask Expressions (dask-expr). This is because the high-level optimization logic becomes much simpler when a filter-based operation of a `ReadParquet` expression can be iteratively "absorbed" into the root `ReadParquet` expression.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #13334
@JacekPliszka
Copy link
Contributor

Hmm, I suggested something similar #38750 just not for 'is' or 'is not' but
for is_nan, is_null and is_valid as they are already implemented in pyarrow.dataset.Expression https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html

@rjzamora
Copy link
Contributor Author

Hmm, I suggested something similar #38750 just not for 'is' or 'is not' but
for is_nan, is_null and is_valid as they are already implemented in pyarrow.dataset.Expression https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html

Thanks for linking @JacekPliszka - I believe that solution would also satisfy our needs as well.

@JacekPliszka
Copy link
Contributor

I made the change and prepared PR but this functions is marked with DeprecationWarning so I have not even worked on the tests.
Possibly it will not be approved and the function removed.

I am using the function in production code so I will just move it to my project where I have it extended a bit to handle type casting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants