You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
The current cudf/dask_cudf.read_parquet APIs accept a filters argument. These filters are only used to drop data at the row-group level. This means that cudf.read_parquet(path, filters=[("x", "==", 10])) is not guaranteed to produce the same result as df = cudf.read_parquet(path); df[df["x"] == 10].
Although it makes sense to use parquet statistics to filter out data at the row-group level, I feel that cudf should enforce the filters on all rows before returning the data to the user (even if those rows still needed to be read in to memory first).
Describe the solution you'd like
Although it would be nice to apply filters in cuio/libcudf, it seems perfectly reasonable to simply convert DNF-formatted filter expressions into cudf/python operations, and apply those operations on the data before returning.
Describe alternatives you've considered
Alternative is the status quo: Passing in filters provides no guarantee that the returned data will satisfy the provided filters.
Additional context
The primary motivation here is the new Dask Expressions (dask-expr) library. It is much easier to implement a predicate-pushdown optimization if a Filter expression can be completely absorbed by a ReadParquet expression by converting the distinct filtering operation into a ReadParquet argument.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
The current
cudf/dask_cudf.read_parquet
APIs accept afilters
argument. These filters are only used to drop data at the row-group level. This means thatcudf.read_parquet(path, filters=[("x", "==", 10]))
is not guaranteed to produce the same result asdf = cudf.read_parquet(path); df[df["x"] == 10]
.Although it makes sense to use parquet statistics to filter out data at the row-group level, I feel that cudf should enforce the filters on all rows before returning the data to the user (even if those rows still needed to be read in to memory first).
Describe the solution you'd like
Although it would be nice to apply filters in cuio/libcudf, it seems perfectly reasonable to simply convert DNF-formatted filter expressions into cudf/python operations, and apply those operations on the data before returning.
Describe alternatives you've considered
Alternative is the status quo: Passing in filters provides no guarantee that the returned data will satisfy the provided filters.
Additional context
The primary motivation here is the new Dask Expressions (dask-expr) library. It is much easier to implement a predicate-pushdown optimization if a
Filter
expression can be completely absorbed by aReadParquet
expression by converting the distinct filtering operation into aReadParquet
argument.The text was updated successfully, but these errors were encountered: