-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to arrow/parquet 11.0 #2048
Conversation
I think this will need some of the changes in #1990 to |
Yes, I've introduced a parquet row group filtering API change in parquet 11. I can port that part from #1990 to your branch. |
@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() { | |||
println!("{}", output.description()); | |||
// This should prune out groups with error, because there is not col to | |||
// prune the row groups. | |||
assert_eq!(output.predicate_evaluation_errors(), Some(1)); | |||
assert_eq!(output.predicate_evaluation_errors(), Some(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are evaluating the filter for each row group now. I think it's an expected change for the number of evaluation errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the changes to the parquet reader look good to me @yjshen -- thank you. I can't really approve my own PR but I suppose I'll leave this one up until tomorrow to see if there is any more feedback
Otherwise we can merge it in
🚀
row_group_metadata, | ||
parquet_schema, | ||
}; | ||
let predicate_values = pruning_predicate.prune(&pruning_stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is probably some overhead here related to calling prune
once per row group vs calling it once per file, but I think it will be ok and we can further optimize it in the future if it shows up in traces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I just stumbled across this whilst updating #1617 - in IOx we found the prune method had non-trivial overheads when run in a non-columnar fashion as this is doing. Admittedly that was likely with more containers than there are likely to be row groups in a file.
I do wonder if we need to take a step back from extending the parquet arrow-rs interface, and take a more holistic look at what the desired end-state should be. I worry a bit that we're painting ourselves into a corner, I'll see if I can get my thoughts into some tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we change ReadOptions like:
pub struct ReadOptions {
predicates: Vec<Box<dyn Fn(&[RowGroupMetaData]) -> vec<bool>>>,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would definitely be one option, but I'm not sure why it needs to be lazy. SerializedFileReader
already exposes the ParquetMetadata
which in turn exposes the [RowGroupMetaData]
. Why wouldn't the caller just specify the row groups to scan, much like it specifies the column indexes for a projection? Would this not be both simpler and more flexible?
@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() { | |||
println!("{}", output.description()); | |||
// This should prune out groups with error, because there is not col to | |||
// prune the row groups. | |||
assert_eq!(output.predicate_evaluation_errors(), Some(1)); | |||
assert_eq!(output.predicate_evaluation_errors(), Some(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree.
$self | ||
.row_group_metadata | ||
.column(column_index) | ||
.statistics() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Update datafusion to latest arrow and parquet release to unblock things like #1990 from @yjshen