[BUG] read_parquet/read_orc with filters do not filter specific rows #12512
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Python
Affects Python cuDF API.
Milestone
Describe the bug
When using
cudf.read_parquet
orread_orc
with the filters argument to filter out rows based on certain predicates, the methods today just filter out reading row groups (or stripes) that can be completely eliminated based on the given condition, but does return all rows from the read row groups without applying the given filters again. This behavior can be confusing to users assuming that all the relevant data has already been filtered out and is contrary to how dask, dask-cuDF and PyArrow behave today.Example:
Data:
Would return 1 , 5, 1 which is all elements from RG0 (RG1 gets filtered out).
Expected output would be 1,1
Steps/Code to reproduce bug
Expected behavior
The 5's from row group 0 also get filtered returning only 1's, which is inline with how pyarrow, dask/dask-cudf return return the result.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: