[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

ayushdg · 2023-01-10T13:15:09Z

Describe the bug
When using cudf.read_parquet or read_orc with the filters argument to filter out rows based on certain predicates, the methods today just filter out reading row groups (or stripes) that can be completely eliminated based on the given condition, but does return all rows from the read row groups without applying the given filters again. This behavior can be confusing to users assuming that all the relevant data has already been filtered out and is contrary to how dask, dask-cuDF and PyArrow behave today.

Example:

Data:

Col Name: A
Row Group 0: 1,5,1
Row Group 1: 5,5,5

cudf.read_parquet("data", filters=[('a','!=',5)])

Would return 1 , 5, 1 which is all elements from RG0 (RG1 gets filtered out).
Expected output would be 1,1

Steps/Code to reproduce bug

df = cudf.DataFrame()

In [6]: df["a"] = [1,5]*2500 + [5]*5000

In [7]: df.to_parquet("rg_test.parquet", row_group_size_rows=5000)

In [8]: cudf.read_parquet("rg_test.parquet")
[10000 rows x 1 columns]

In [9]: cudf.read_parquet("rg_test.parquet", filters=[("a", "!=", 5)])
[5000 rows x 1 columns]

Expected behavior
The 5's from row group 0 also get filtered returning only 1's, which is inline with how pyarrow, dask/dask-cudf return return the result.

Environment overview (please complete the following information)

Environment location: bare-metal
Method of cuDF install: conda
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

GregoryKimball · 2023-04-19T22:42:13Z

Hello I would like to update this issue now that we have support for libcudf ASTs in cuDF's DataFrame.query. I propose that we add a filtering step to cudf.read_parquet if the filters argument is present, similar to the following approach:

df = cudf.DataFrame({'a': range(10), 'b': range(10,20)})
df.to_parquet('test.parquet')
filters = [
    [('a', '>', 7),('b', '>', 15)],
    [('a', '<', 2)],
]
df = cudf.read_parquet('test.parquet', filters=filters)

assert isinstance(filters, list) and len(filters) > 0, "Invalid filters"
if isinstance(filters[0], tuple):
    filters = [filters]
expr = ' or '.join([f'(({") and (".join([f"{col} {o} {val}" for col, o, val in f])}))' for f in filters])
df_filtered = df.query(expr)

Edit: now that we have string scalar support in libcudf ASTs we might want to add a pattern for double-quoting string values

See pyarrow.parquet.read_table for more information about the grammar of filters. The grammar is single-column disjunctive normal form (DNF) and a subset of what ASTs can represent. We are missing in and not in operators but these could be converted to ANDed == or !=.

GregoryKimball · 2023-05-17T20:14:46Z

For the parquet reader, this issue was addressed in #13334. We still need to verify/modify the ORC reader.

ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Jan 10, 2023

GregoryKimball added this to the Parquet continuous improvement milestone Apr 2, 2023

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

GregoryKimball added this to libcudf Apr 20, 2023

GregoryKimball moved this to Needs owner in libcudf Apr 20, 2023

GregoryKimball mentioned this issue May 11, 2023

Add row-wise filtering step to read_parquet #13334

Merged

3 tasks

GregoryKimball modified the milestones: Parquet continuous improvement, ORC continuous improvement Jun 6, 2023

GregoryKimball mentioned this issue Jun 7, 2023

[FEA] Rename filters= argument to row_group_filters= in read_parquet and read_orc and provide examples that show its use #13370

Open

GregoryKimball mentioned this issue Sep 10, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

GregoryKimball removed the status in libcudf Sep 25, 2023

GregoryKimball removed this from libcudf Oct 26, 2023

vyasr added this to cuDF Python Nov 5, 2024

github-project-automation bot moved this to Todo in cuDF Python Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

ayushdg commented Jan 10, 2023

GregoryKimball commented Apr 19, 2023 •

edited

Loading

GregoryKimball commented May 17, 2023

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

Comments

ayushdg commented Jan 10, 2023

GregoryKimball commented Apr 19, 2023 • edited Loading

GregoryKimball commented May 17, 2023

GregoryKimball commented Apr 19, 2023 •

edited

Loading