Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

Open
ayushdg opened this issue Jan 10, 2023 · 2 comments
Open

[BUG] read_parquet/read_orc with filters do not filter specific rows #12512

ayushdg opened this issue Jan 10, 2023 · 2 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@ayushdg
Copy link
Member

ayushdg commented Jan 10, 2023

Describe the bug
When using cudf.read_parquet or read_orc with the filters argument to filter out rows based on certain predicates, the methods today just filter out reading row groups (or stripes) that can be completely eliminated based on the given condition, but does return all rows from the read row groups without applying the given filters again. This behavior can be confusing to users assuming that all the relevant data has already been filtered out and is contrary to how dask, dask-cuDF and PyArrow behave today.

Example:

Data:

Col Name: A
Row Group 0: 1,5,1
Row Group 1: 5,5,5
cudf.read_parquet("data", filters=[('a','!=',5)])

Would return 1 , 5, 1 which is all elements from RG0 (RG1 gets filtered out).
Expected output would be 1,1

Steps/Code to reproduce bug

df = cudf.DataFrame()

In [6]: df["a"] = [1,5]*2500 + [5]*5000

In [7]: df.to_parquet("rg_test.parquet", row_group_size_rows=5000)

In [8]: cudf.read_parquet("rg_test.parquet")
[10000 rows x 1 columns]

In [9]: cudf.read_parquet("rg_test.parquet", filters=[("a", "!=", 5)])
[5000 rows x 1 columns]

Expected behavior
The 5's from row group 0 also get filtered returning only 1's, which is inline with how pyarrow, dask/dask-cudf return return the result.

Environment overview (please complete the following information)

  • Environment location: bare-metal
  • Method of cuDF install: conda
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Jan 10, 2023
@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. cuIO cuIO issue and removed Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Apr 2, 2023
@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 19, 2023

Hello I would like to update this issue now that we have support for libcudf ASTs in cuDF's DataFrame.query. I propose that we add a filtering step to cudf.read_parquet if the filters argument is present, similar to the following approach:

df = cudf.DataFrame({'a': range(10), 'b': range(10,20)})
df.to_parquet('test.parquet')
filters = [
    [('a', '>', 7),('b', '>', 15)],
    [('a', '<', 2)],
]
df = cudf.read_parquet('test.parquet', filters=filters)

assert isinstance(filters, list) and len(filters) > 0, "Invalid filters"
if isinstance(filters[0], tuple):
    filters = [filters]
expr = ' or '.join([f'(({") and (".join([f"{col} {o} {val}" for col, o, val in f])}))' for f in filters])
df_filtered = df.query(expr)
   a   b
0  0  10
1  1  11
8  8  18
9  9  19

Edit: now that we have string scalar support in libcudf ASTs we might want to add a pattern for double-quoting string values

See pyarrow.parquet.read_table for more information about the grammar of filters. The grammar is single-column disjunctive normal form (DNF) and a subset of what ASTs can represent. We are missing in and not in operators but these could be converted to ANDed == or !=.

@GregoryKimball
Copy link
Contributor

For the parquet reader, this issue was addressed in #13334. We still need to verify/modify the ORC reader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
Status: Todo
Development

No branches or pull requests

2 participants