Convert row filter to arrow filter #265

liurenjie1024 · 2024-03-14T03:17:52Z

parquet crate supports pushing down filter into file reader: https://arrow.apache.org/rust/parquet/arrow/arrow_reader/type.ParquetRecordBatchReaderBuilder.html

We should convert it to arrow row filter so that we can avoid reading as much as possible.

viirya · 2024-03-19T19:07:25Z

I'll look into this.

a-agmon · 2024-03-21T15:54:17Z

Hi @viirya
Perhaps a bit off-topic but wondering what you think.
I have been testing this a bit, and while I have always seen performance improvements in using ParquetRecordBatchStream over ParquetRecordBatchReader, the benefit of using RowFilter was really dependent on the predicate and data. Sometimes it even had a negative impact on performance (even comparing to non async reader). I think that was the case when filtering for very "common" values.
Is there some conventional wisdom regarding when it shouldn't be used?

viirya · 2024-03-21T19:53:25Z

Hmm, I wonder if the filtering takes too much time cost on so called common values? Is the predicate filter very complicated? Normally I think filtering on scan can boost performance. In Spark, I don't remember there are similar cases that non-predicate pushdown performs better than pushdown case.

Is any specific filter predicate causing that?

a-agmon · 2024-03-21T21:51:17Z

Perhaps I am missing something, but I was running this simple test on a small parquet file (65MB) and a simple predicate (column country code).
This is the result I saw:

Predicate KR - row count: 12660 with_filter: true => time taken: 656.518875ms
Predicate KR - row count: 12660 with_filter: false => time taken: 844.822917ms
Predicate US - row count: 158015 with_filter: true => time taken: 1.085824833s
Predicate US - row count: 158015 with_filter: false => time taken: 862.845125ms

As you can see, when the values are "less common" (as in KR predicate), and I guess that skipping is beneficial, we see that row filter improves perf. But when the predicate is very common (as in the US predicate), and I guess it might exist in almost every batch then row filter in fact has a negative impact

liurenjie1024 · 2024-03-22T00:58:41Z

I think this depends on the selectivity, and also the implementation. To achieve best performance, the scan reader need to perform vectorized execution to convert filter to selection vector(or visibility bitmap). I'm not 100% sure how parquet reader achieves this, but the interface shows that it's comparing values one by one? This maybe actually slow.

liurenjie1024 mentioned this issue Mar 14, 2024

Tracking: Support row filter in table scan. #153

Closed

9 tasks

liurenjie1024 added this to iceberg-rust Mar 14, 2024

liurenjie1024 moved this to Todo in iceberg-rust Mar 14, 2024

liurenjie1024 added this to the 0.3.0 Release milestone Mar 14, 2024

viirya mentioned this issue Mar 24, 2024

feat: Convert predicate to arrow filter and push down to parquet reader #295

Merged

liurenjie1024 moved this from Todo to In Progress in iceberg-rust Mar 27, 2024

liurenjie1024 closed this as completed in #295 May 15, 2024

github-project-automation bot moved this from In Progress to Done in iceberg-rust May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert row filter to arrow filter #265

Convert row filter to arrow filter #265

liurenjie1024 commented Mar 14, 2024 •

edited

Loading

viirya commented Mar 19, 2024 •

edited

Loading

a-agmon commented Mar 21, 2024

viirya commented Mar 21, 2024

a-agmon commented Mar 21, 2024

liurenjie1024 commented Mar 22, 2024

Convert row filter to arrow filter #265

Convert row filter to arrow filter #265

Comments

liurenjie1024 commented Mar 14, 2024 • edited Loading

viirya commented Mar 19, 2024 • edited Loading

a-agmon commented Mar 21, 2024

viirya commented Mar 21, 2024

a-agmon commented Mar 21, 2024

liurenjie1024 commented Mar 22, 2024

liurenjie1024 commented Mar 14, 2024 •

edited

Loading

viirya commented Mar 19, 2024 •

edited

Loading