Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert row filter to arrow filter #265

Closed
Tracked by #153
liurenjie1024 opened this issue Mar 14, 2024 · 5 comments · Fixed by #295
Closed
Tracked by #153

Convert row filter to arrow filter #265

liurenjie1024 opened this issue Mar 14, 2024 · 5 comments · Fixed by #295
Milestone

Comments

@liurenjie1024
Copy link
Contributor

liurenjie1024 commented Mar 14, 2024

parquet crate supports pushing down filter into file reader: https://arrow.apache.org/rust/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html

We should convert it to arrow row filter so that we can avoid reading as much as possible.

@viirya
Copy link
Member

viirya commented Mar 19, 2024

I'll look into this.

@a-agmon
Copy link
Contributor

a-agmon commented Mar 21, 2024

Hi @viirya
Perhaps a bit off-topic but wondering what you think.
I have been testing this a bit, and while I have always seen performance improvements in using ParquetRecordBatchStream over ParquetRecordBatchReader, the benefit of using RowFilter was really dependent on the predicate and data. Sometimes it even had a negative impact on performance (even comparing to non async reader). I think that was the case when filtering for very "common" values.
Is there some conventional wisdom regarding when it shouldn't be used?

@viirya
Copy link
Member

viirya commented Mar 21, 2024

Hmm, I wonder if the filtering takes too much time cost on so called common values? Is the predicate filter very complicated? Normally I think filtering on scan can boost performance. In Spark, I don't remember there are similar cases that non-predicate pushdown performs better than pushdown case.

Is any specific filter predicate causing that?

@a-agmon
Copy link
Contributor

a-agmon commented Mar 21, 2024

Perhaps I am missing something, but I was running this simple test on a small parquet file (65MB) and a simple predicate (column country code).
This is the result I saw:

Predicate KR - row count: 12660 with_filter: true => time taken: 656.518875ms
Predicate KR - row count: 12660 with_filter: false => time taken: 844.822917ms
Predicate US - row count: 158015 with_filter: true => time taken: 1.085824833s
Predicate US - row count: 158015 with_filter: false => time taken: 862.845125ms

As you can see, when the values are "less common" (as in KR predicate), and I guess that skipping is beneficial, we see that row filter improves perf. But when the predicate is very common (as in the US predicate), and I guess it might exist in almost every batch then row filter in fact has a negative impact

@liurenjie1024
Copy link
Contributor Author

I think this depends on the selectivity, and also the implementation. To achieve best performance, the scan reader need to perform vectorized execution to convert filter to selection vector(or visibility bitmap). I'm not 100% sure how parquet reader achieves this, but the interface shows that it's comparing values one by one? This maybe actually slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants