-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Add support for "is" and "is not" to pyarrow.parquet.filters_to_expression
#34504
Comments
This PR adds a `post_filters` argument to `cudf.read_parquet`, which is set equal to the `filters` argument by default. When this argument is set, the specified DNF (disjunctive normal form) filter expression will be applied to the in-memory `cudf.DataFrame` object after IO is performed. The overal result of this PR is that the behavior of `cudf.read_parquet` becomes more consistent with that of `pd.read_parquet` in the sense that the default result will now enforce filters at a row-wise granularity for both libraries. ### Note on the "need" for distinct `filters` and `post_filters` arguments My hope is that `post_filters` will eventually go away. However, I added a distinct argument for two general reasons: 1. PyArrow does not yet support `"is"` and `"is not"` operands in `filters`. Therefore, we can not pass along **all** filters from `dask`/`dask_cudf` down to `cudf.read_parquet` using the existing `filters` argument, because we rely on pyarrow to filter out row-groups (note that dask implements its own filter-conversion utility to avoid this problem). I'm hoping pyarrow will eventually adopt these comparison types (xref: apache/arrow#34504) 2. When `cudf.read_parquet` is called from `dask_cudf.read_parquet`, row-group filtering will have already been applied. Therefore, it is convenient to specify that you only need cudf to provide the post-IO row-wise filtering step. Otherwise, we are effectively duplicating some metadata processing. My primary concern with adding `post_filters` is the idea that row-wise filtering *could* be added at the cuio/libcudf level in the future. In that (hypothetical) case, `post_filters` wouldn't really be providing any value, but we would probably be able to deprecate it without much pain (if any). ## Additional Context This feature is ultimately **needed** to support general predicate-pushdown optimizations in Dask Expressions (dask-expr). This is because the high-level optimization logic becomes much simpler when a filter-based operation of a `ReadParquet` expression can be iteratively "absorbed" into the root `ReadParquet` expression. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) URL: #13334
Hmm, I suggested something similar #38750 just not for 'is' or 'is not' but |
Thanks for linking @JacekPliszka - I believe that solution would also satisfy our needs as well. |
I made the change and prepared PR but this functions is marked with DeprecationWarning so I have not even worked on the tests. I am using the function in production code so I will just move it to my project where I have it extended a bit to handle type casting |
Describe the enhancement requested
As discussed in dask/dask#9845, it is not currently possible to pass a
List[Tuple]
-formatted filter tofilters_to_expression
that can be converted into an expression that will filter null/nan values correctly. This is because something like[("a", "!=", np.nan)
will result in<pyarrow.compute.Expression (a != nan)>
instead of<pyarrow.compute.Expression invert(is_null(a, {nan_is_null=true}))>
.@j-bennet suggested that we add support for
"is"
and"is not"
operators for this, and @jorisvandenbossche seemed receptive to this idea.Since this was a blocking issue for some RAPIDS users, dask has temporarily added its own custom implementation of
filters_to_expression
. Is there any interest in adopting a similar version of this utility in pyarrow?Note: It would be nice if the
filters_to_expression
utility made it possible to mimic pandas behavior and avoid null propagation for predicates like[("a", "!=", 1)]
(by taking such a predicate to mean<pyarrow.compute.Expression (is_null(a, {nan_is_null=true}) or (a != 1))>
on the arrow side). Dask's custom implementation adds thepropagate_null
argument for this purpose.Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: