Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: bloom filters are not being used in Parquet queries #8685

Closed
alamb opened this issue Dec 30, 2023 · 7 comments · Fixed by #8732
Closed

Regression: bloom filters are not being used in Parquet queries #8685

alamb opened this issue Dec 30, 2023 · 7 comments · Fixed by #8732
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Dec 30, 2023

          > How do you know the bloom filter isn't being used? Is there a reproducer (a parquet file) you can share?

It appears that there is no good way to know if the bloom filter code is working via logging or metrics 🤔

https://github.com/apache/arrow-datafusion/blob/f39c040ace0b34b0775827907aa01d6bb71cbb14/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L111-L168

I conducted a test locally by writing 200GB of data. When using a Bloom filter for queries, I observed that the query only takes 0.1 seconds, whereas without using the Bloom filter, the query takes 1 second. If a query takes 1 second, I can infer that it is not using the Bloom filter because using the Bloom filter should yield results within 0.1 seconds.

Originally posted by @domyway in #8436 (comment)

@alamb
Copy link
Contributor Author

alamb commented Dec 30, 2023

I think the next step to proceed here would be to get some sort of reproducer so we can debug further.

@alamb alamb added the bug Something isn't working label Dec 30, 2023
@my-vegetable-has-exploded
Copy link
Contributor

I tried to add more detailed metric for bloomfilters. Codes here https://github.com/apache/arrow-datafusion/compare/main...my-vegetable-has-exploded:arrow-datafusion:metric-sbbf?expand=1, it works well on unit tests. But when I build datafusion-cli, it fails to execute EXPLAIN command.
图片

❯ EXPLAIN ANALYZE SELECT * FROM taxi WHERE (taxi."String" IN ('a'));
Internal error: Optimization not supported for ANALYZE.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

@alamb
Copy link
Contributor Author

alamb commented Dec 30, 2023

it works well on unit tests. But when I build datafusion-cli, it fails to execute EXPLAIN command.

I filed #8690 to track

@alamb
Copy link
Contributor Author

alamb commented Jan 2, 2024

it works well on unit tests. But when I build datafusion-cli, it fails to execute EXPLAIN command.

I filed #8690 to track

The issue has been fixed now

@my-vegetable-has-exploded
Copy link
Contributor

good catch @domyway.

@alamb
Copy link
Contributor Author

alamb commented Jan 3, 2024

For anyone following along, the fix is #8732

@alamb alamb changed the title Report that bloom filters are not being used in Parquet queries Regression: bloom filters are not being used in Parquet queries Jan 3, 2024
@my-vegetable-has-exploded
Copy link
Contributor

Hi @domyway, you can check whether bloom filter works by row_groups_pruned_bloom_filter metric now.
In my environment, bloom filter works.

❯ CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/home/deepin/rust/arrow-datafusion/parquet-testing/data/data_index_bloom_encoding_stats.parquet'
;
0 rows in set. Query took 0.002 seconds.

❯ SET datafusion.execution.parquet.bloom_filter_enabled to true;
0 rows in set. Query took 0.001 seconds.

❯ EXPLAIN ANALYZE SELECT * FROM taxi WHERE (taxi."String" IN ('bb', 'bbc', 'bba', 'bbd', 'bbg', 'bbf', 'bbn', 'nnfa', 'bbnfd', 'bbx', 'bbxda', 'badfas', 'afd', 'adfas', 'adfa', 'asdfer', 'sefarj', 'erseioio', 'uioosdf', '0ba24'));
....row_groups_pruned_bloom_filter=1, .....

Thank you for finding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants