Update to arrow/parquet 11.0 #2048

alamb · 2022-03-21T14:05:13Z

Update datafusion to latest arrow and parquet release to unblock things like #1990 from @yjshen

alamb · 2022-03-21T14:09:42Z

I think this will need some of the changes in #1990 to datafusion/src/physical_plan/file_format/parquet.rs order to compile. I'll try and get around to it later this week if no one else beats me to it

yjshen · 2022-03-21T14:13:14Z

Yes, I've introduced a parquet row group filtering API change in parquet 11. I can port that part from #1990 to your branch.

yjshen · 2022-03-21T15:19:50Z

datafusion/tests/parquet_pruning.rs

@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() {
    println!("{}", output.description());
    // This should prune out groups with error, because there is not col to
    // prune the row groups.
-    assert_eq!(output.predicate_evaluation_errors(), Some(1));
+    assert_eq!(output.predicate_evaluation_errors(), Some(4));


We are evaluating the filter for each row group now. I think it's an expected change for the number of evaluation errors.

yeah, I agree.

alamb

the changes to the parquet reader look good to me @yjshen -- thank you. I can't really approve my own PR but I suppose I'll leave this one up until tomorrow to see if there is any more feedback

Otherwise we can merge it in

🚀

alamb · 2022-03-21T18:21:19Z

datafusion/src/physical_plan/file_format/parquet.rs

+                row_group_metadata,
+                parquet_schema,
+            };
+            let predicate_values = pruning_predicate.prune(&pruning_stats);


there is probably some overhead here related to calling prune once per row group vs calling it once per file, but I think it will be ok and we can further optimize it in the future if it shows up in traces.

Yeah... I just stumbled across this whilst updating #1617 - in IOx we found the prune method had non-trivial overheads when run in a non-columnar fashion as this is doing. Admittedly that was likely with more containers than there are likely to be row groups in a file.

I do wonder if we need to take a step back from extending the parquet arrow-rs interface, and take a more holistic look at what the desired end-state should be. I worry a bit that we're painting ourselves into a corner, I'll see if I can get my thoughts into some tickets

How about we change ReadOptions like:

pub struct ReadOptions { predicates: Vec<Box<dyn Fn(&[RowGroupMetaData]) -> vec<bool>>>, }

That would definitely be one option, but I'm not sure why it needs to be lazy. SerializedFileReader already exposes the ParquetMetadata which in turn exposes the [RowGroupMetaData]. Why wouldn't the caller just specify the row groups to scan, much like it specifies the column indexes for a projection? Would this not be both simpler and more flexible?

alamb · 2022-03-21T18:22:06Z

datafusion/tests/parquet_pruning.rs

@@ -262,7 +262,7 @@ async fn prune_int32_scalar_fun() {
    println!("{}", output.description());
    // This should prune out groups with error, because there is not col to
    // prune the row groups.
-    assert_eq!(output.predicate_evaluation_errors(), Some(1));
+    assert_eq!(output.predicate_evaluation_errors(), Some(4));


yeah, I agree.

Dandandan · 2022-03-21T20:32:02Z

datafusion/src/physical_plan/file_format/parquet.rs

+            $self
+                .row_group_metadata
+                .column(column_index)
+                .statistics()


Update to arrow/parquet 11

dd495cb

github-actions bot added ballista datafusion Changes in the datafusion crate labels Mar 21, 2022

yjshen added 2 commits March 21, 2022 23:05

Adapt to API changes

6f9f646

macro fmt

5ee2d27

yjshen reviewed Mar 21, 2022

View reviewed changes

alamb mentioned this pull request Mar 21, 2022

Allow register_catalog to return an Error #2051

Closed

alamb commented Mar 21, 2022

View reviewed changes

This was referenced Mar 21, 2022

[WIP] Arrow (file) datasource #1858

Closed

Add write_ipc to ExecutionContext #1893

Closed

Dandandan reviewed Mar 21, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs

$self

.row_group_metadata

.column(column_index)

.statistics()

Copy link

Contributor

Dandandan Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

This was referenced Mar 22, 2022

Update parquet requirement from 10.0 to 11.0 #2055

Closed

Update arrow-flight requirement from 10.0 to 11.0 #2057

Closed

Update arrow requirement from 10.0 to 11.0 #2056

Closed

alamb merged commit 2e6833c into apache:master Mar 22, 2022

alamb deleted the alamb/update_arrow_11 branch March 22, 2022 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to arrow/parquet 11.0 #2048

Update to arrow/parquet 11.0 #2048

alamb commented Mar 21, 2022

alamb commented Mar 21, 2022

yjshen commented Mar 21, 2022

yjshen Mar 21, 2022 •

edited

Loading

alamb Mar 21, 2022

alamb left a comment

alamb Mar 21, 2022

tustvold Mar 22, 2022

yjshen Mar 22, 2022 •

edited

Loading

tustvold Mar 22, 2022 •

edited

Loading

alamb Mar 21, 2022

Dandandan Mar 21, 2022

Update to arrow/parquet 11.0 #2048

Update to arrow/parquet 11.0 #2048

Conversation

alamb commented Mar 21, 2022

alamb commented Mar 21, 2022

yjshen commented Mar 21, 2022

yjshen Mar 21, 2022 • edited Loading

Choose a reason for hiding this comment

alamb Mar 21, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 21, 2022

Choose a reason for hiding this comment

tustvold Mar 22, 2022

Choose a reason for hiding this comment

yjshen Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

tustvold Mar 22, 2022 • edited Loading

Choose a reason for hiding this comment

alamb Mar 21, 2022

Choose a reason for hiding this comment

Dandandan Mar 21, 2022

Choose a reason for hiding this comment

yjshen Mar 21, 2022 •

edited

Loading

yjshen Mar 22, 2022 •

edited

Loading

tustvold Mar 22, 2022 •

edited

Loading