-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert row filter to arrow filter #265
Comments
I'll look into this. |
Hi @viirya |
Hmm, I wonder if the filtering takes too much time cost on so called common values? Is the predicate filter very complicated? Normally I think filtering on scan can boost performance. In Spark, I don't remember there are similar cases that non-predicate pushdown performs better than pushdown case. Is any specific filter predicate causing that? |
Perhaps I am missing something, but I was running this simple test on a small parquet file (65MB) and a simple predicate (column country code).
As you can see, when the values are "less common" (as in KR predicate), and I guess that skipping is beneficial, we see that row filter improves perf. But when the predicate is very common (as in the US predicate), and I guess it might exist in almost every batch then row filter in fact has a negative impact |
I think this depends on the selectivity, and also the implementation. To achieve best performance, the scan reader need to perform vectorized execution to convert filter to selection vector(or visibility bitmap). I'm not 100% sure how parquet reader achieves this, but the interface shows that it's comparing values one by one? This maybe actually slow. |
parquet crate supports pushing down filter into file reader: https://arrow.apache.org/rust/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html
We should convert it to arrow row filter so that we can avoid reading as much as possible.
The text was updated successfully, but these errors were encountered: