Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet: Make page_index/pushdown metrics consistent with row_group metrics #12545

Merged
merged 3 commits into from
Sep 22, 2024

Conversation

progval
Copy link
Contributor

@progval progval commented Sep 20, 2024

Which issue does this PR close?

Closes #12543.
Closes #12544.

What changes are included in this PR?

  1. Rename {pushdown,page_index}_filtered to {pushdown,page_index}_pruned
  2. Add {pushdown,page_index}_matched
  3. Added documentation for existing pushdown-related metrics

Rationale for this change

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is not checked because their row groups were already eliminated (with a Bloom Filter or row group statistics).

Are these changes tested?

yes

Are there any user-facing changes?

New metrics in EXPLAIN ANALYZE, documented in docs/source/user-guide/explain-usage.md

…etrics

1. Rename `{pushdown,page_index}_filtered` to `{pushdown,page_index}_pruned`
2. Add `{pushdown,page_index}_matched`

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is
not checked because their row groups were already eliminated
(with a Bloom Filter or row group statistics).
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Sep 20, 2024
@alamb alamb added the api change Changes the API exposed to users of the crate label Sep 20, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @progval -- this looks like a very nice improvement to me. I left some small suggestions but I don't think they are required to merge this PR

@@ -276,6 +281,14 @@ fn rows_skipped(selection: &RowSelection) -> usize {
.fold(0, |acc, x| if x.skip { acc + x.row_count } else { acc })
}

/// returns the number of rows not skipped in the selection
/// TODO should this be upstreamed to RowSelection?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks the same as https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.RowSelection.html#method.row_count

It would be great to upstream this and rows_skipped to parquet -- any chance you are willing to file a ticket to do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -223,6 +223,21 @@ Again, reading from bottom up:
- `SortPreservingMergeExec`
- `output_rows=5`, `elapsed_compute=2.375µs`: Produced the final 5 rows in 2.375µs (microseconds)

When predicate pushdown is enabled, `ParquetExec` gains the following metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

docs/source/user-guide/explain-usage.md Outdated Show resolved Hide resolved
@alamb alamb merged commit 300a39b into apache:main Sep 22, 2024
25 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 22, 2024

Thanks agian @progval

bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
…etrics (apache#12545)

* parquet: Make page_index/pushdown metrics consistent with row_group metrics

1. Rename `{pushdown,page_index}_filtered` to `{pushdown,page_index}_pruned`
2. Add `{pushdown,page_index}_matched`

The latter makes it clearer in EXPLAIN ANALYZE when the Page Index is
not checked because their row groups were already eliminated
(with a Bloom Filter or row group statistics).

* Add missing metric definitions in the docs

Co-authored-by: Andrew Lamb <[email protected]>

* s/pass/select/

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate documentation Improvements or additions to documentation
Projects
None yet
2 participants