Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect results with parquet filtering pushdown enabled #4005

Closed
Tracked by #3463 ...
alamb opened this issue Oct 28, 2022 · 3 comments · Fixed by #4021
Closed
Tracked by #3463 ...

Incorrect results with parquet filtering pushdown enabled #4005

alamb opened this issue Oct 28, 2022 · 3 comments · Fixed by #4021
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 28, 2022

Describe the bug
DataFusion gets different answers when parquet pushdown is enabled

NOTE that pushdown filtering is not enabled by default (as we are still working on it) so this issue will not likely affect users:

To Reproduce

  1. Download data from
    repro.zip
  2. Run datafusion CLI

The query run is

select count(*) from foo where container = 'backend_container_0' OR pod = 'aqcathnxqsphdhgjtgvxsfyiwbmhlmg';

Expected behavior
Same answer should be produced with and without page index filtering enabled. However, the answers are different

Without filter pushdown 39982 rows are produced

$ DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=false datafusion-cli -f script.sql
...
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 39982           |
+-----------------+

With it enabled:

DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true datafusion-cli -f script.sql
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 0               |
+-----------------+
1 row in set. Query took 0.004 seconds.

Additional context
Found by the test here #3976

@tustvold
Copy link
Contributor

tustvold commented Oct 29, 2022

I wonder if this relates to pushing down an empty projection, it is possible there is a bug there, I don't remember seeing a test in arrow-rs for this.

Edit: Sadly not 😢

Edit Edit: I think I have found it

@tustvold
Copy link
Contributor

Ok so the bug is that DatafusionArrowPredicate is assuming that ProjectionMask is order preserving, and so is not remapping the columns before passing them to PhysicalExpr. In reality ProjectionMask is not order preserving, and so the batch passed to ArrowPredicate::evaluate has columns in the order of the file's schema. In the case of this query, this results in it evaluating the predicates against the wrong columns, resulting in no rows that pass the predicate.

You can see this, by reordering the predicates in the query

select count(*) from foo where pod = 'aqcathnxqsphdhgjtgvxsfyiwbmhlmg' OR container = 'backend_container_0';
+-----------------+
| COUNT(UInt8(1)) |
+-----------------+
| 39982           |
+-----------------+
1 row in set. Query took 0.107 seconds.

This is likely also the issue behind #4006

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2022

Thank you for debugging / fixing this @tustvold

jimexist pushed a commit to jimexist/arrow-datafusion that referenced this issue Oct 31, 2022
…edicate (apache#4005) (apache#4006) (apache#4021)

* Project columns within DatafusionArrowPredicate (apache#4005) (apache#4006)

* Add test

* Format

* Fix merge blunder

Co-authored-by: Andrew Lamb <[email protected]>
Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this issue Nov 5, 2022
…edicate (apache#4005) (apache#4006) (apache#4021)

* Project columns within DatafusionArrowPredicate (apache#4005) (apache#4006)

* Add test

* Format

* Fix merge blunder

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants