Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for parquet page level skipping #4105

Merged
merged 4 commits into from
Nov 7, 2022

Conversation

Ted-Jiang
Copy link
Member

Signed-off-by: yangjiang [email protected]

Which issue does this PR close?

Closes #4086 .

Rationale for this change

we can get metric like

ParquetExec: limit=None, partitions=[Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet],
  predicate=l_shipdate_min@0 <= 8037, projection=[l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate], metrics=[output_rows=1020000, elapsed_compute=1ns, spill_count=0, spilled_bytes=0, mem_used=0, 
  pushdown_rows_filtered{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=0, 
   page_index_rows_filtered{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=5000000, predicate_evaluation_errors{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=0, row_groups_pruned{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=0, bytes_scanned{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=13198873, num_predicate_creation_errors=0, time_elapsed_scanning=114.532731ms, time_elapsed_processing=3.561757295s, pushdown_eval_time{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=2ns, 

   page_index_eval_time{filename=Users/yangjiang/test-data/1g_tpch_pageIndex/lineitem/part-00000-7d2abab2-a301-4452-9f1d-c641e7f15af4-c000.snappy.parquet}=475.372µs, time_elapsed_opening=58.666766ms] |

What changes are included in this PR?

Are there any user-facing changes?

We i try to create single column with multi pages, i think there is a bug in page size check. but i found apache/arrow-rs#2941 add page row number check(not release).

@Ted-Jiang Ted-Jiang marked this pull request as draft November 4, 2022 10:18
@github-actions github-actions bot added the core Core DataFusion crate label Nov 4, 2022
@Ted-Jiang Ted-Jiang marked this pull request as ready for review November 4, 2022 10:24

let metrics = rt.parquet_exec.metrics().unwrap();

// todo fix this https://github.com/apache/arrow-rs/issues/2941 release change to row limit.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/arrow-rs/pull/2942/files#r1013838557
I think theres is a bug in should_add_data_page self.encoder.num_values() always zero.
So no matter how to set data_pagesize_limit and write_batch_size always return 1 page in on colunn chunk.
🤔 I think someone metion this before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But from the real world file, this metic works fine😂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in #4039

@Ted-Jiang Ted-Jiang requested a review from alamb November 4, 2022 10:28
@@ -449,7 +449,7 @@ impl FileOpener for ParquetOpener {
// page index pruning: if all data on individual pages can
// be ruled using page metadata, rows from other columns
// with that range can be skipped as well
if let Some(row_selection) = enable_page_index
if let Some(row_selection) = (enable_page_index && !row_groups.is_empty())
Copy link
Member Author

@Ted-Jiang Ted-Jiang Nov 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when all rowGroups are pruned by rg_metadata(min max), we do this skip fast

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @Ted-Jiang -- thank you 🎉

Note that apache/arrow-rs#2941 should be included as part of #4039 so hopefully we can fix it soon.

I think it would be fine to either merge this PR as is and fix the test as a follow on, or else wait for #4039 to merge and then update this PR with the better stats.

@@ -63,13 +67,22 @@ impl ParquetFileMetrics {
let pushdown_eval_time = MetricBuilder::new(metrics)
.with_new_label("filename", filename.to_string())
.subset_time("pushdown_eval_time", partition);
let page_index_rows_filtered = MetricBuilder::new(metrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Dandandan Dandandan changed the title Add statistics for parquet page level skipping Add metrics for parquet page level skipping Nov 4, 2022
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner labels Nov 5, 2022
Signed-off-by: yangjiang <[email protected]>
@github-actions github-actions bot removed logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sql SQL Planner labels Nov 5, 2022
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Nov 5, 2022

@alamb fix ut in use page row limit fix ut. just happened to arrow be released 😂

Signed-off-by: yangjiang <[email protected]>
Signed-off-by: yangjiang <[email protected]>
@Ted-Jiang Ted-Jiang requested a review from alamb November 7, 2022 02:33
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

giphy

@alamb alamb merged commit 4d23cae into apache:master Nov 7, 2022
@alamb
Copy link
Contributor

alamb commented Nov 7, 2022

Thanks a lot @Ted-Jiang

@ursabot
Copy link

ursabot commented Nov 7, 2022

Benchmark runs are scheduled for baseline = b7a3331 and contender = 4d23cae. 4d23cae is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics for parquet page level skipping
3 participants