-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve documentation about ParquetExec
/ Parquet predicate pushdown
#11994
Merged
Merged
Changes from 1 commit
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
087d937
Minor: improve ParquetExec docs
alamb d88ad71
typo
alamb 8bcfa59
clippy
alamb 5ab345a
fix whitespace so rustdoc does not treat as tests
alamb d9f37a4
Apply suggestions from code review
alamb c0b9012
expound upon column rewriting in the context of schema evolution
alamb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -116,13 +116,12 @@ pub use writer::plan_to_parquet; | |
/// | ||
/// Supports the following optimizations: | ||
/// | ||
/// * Concurrent reads: Can read from one or more files in parallel as multiple | ||
/// * Concurrent reads: reads from one or more files in parallel as multiple | ||
/// partitions, including concurrently reading multiple row groups from a single | ||
/// file. | ||
/// | ||
/// * Predicate push down: skips row groups and pages based on | ||
/// min/max/null_counts in the row group metadata, the page index and bloom | ||
/// filters. | ||
/// * Predicate push down: skips row groups and pages and rows based on metadata | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// and late materialization. See "Predicate Pushdown" below. | ||
/// | ||
/// * Projection pushdown: reads and decodes only the columns required. | ||
/// | ||
|
@@ -132,9 +131,8 @@ pub use writer::plan_to_parquet; | |
/// coalesce I/O operations, etc. See [`ParquetFileReaderFactory`] for more | ||
/// details. | ||
/// | ||
/// * Schema adapters: read parquet files with different schemas into a unified | ||
/// table schema. This can be used to implement "schema evolution". See | ||
/// [`SchemaAdapterFactory`] for more details. | ||
/// * Schema evolution: read parquet files with different schemas into a unified | ||
/// table schema. See [`SchemaAdapterFactory`] for more details. | ||
/// | ||
/// * metadata_size_hint: controls the number of bytes read from the end of the | ||
/// file in the initial I/O when the default [`ParquetFileReaderFactory`]. If a | ||
|
@@ -144,6 +142,29 @@ pub use writer::plan_to_parquet; | |
/// * User provided [`ParquetAccessPlan`]s to skip row groups and/or pages | ||
/// based on external information. See "Implementing External Indexes" below | ||
/// | ||
/// # Predicate Pushdown | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to consolidate the description of what predicate pushdown is done in the ParquetExec |
||
/// | ||
/// `ParquetExec` uses the provided [`PhysicalExpr`] predicate as a filter to | ||
/// skip reading data and improve query performance using several techniques: | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// | ||
/// * Row group pruning: skips entire row groups based on min/max statistics | ||
/// found in [`ParquetMetaData`] and any Bloom filters that are present. | ||
/// | ||
/// * Page pruning: skips individual pages within a ColumnChunk using the | ||
/// [Parquet PageIndex], if present. | ||
/// | ||
/// * Row filtering: skips rows within a page based using a form of late | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// materialization. When possible, predicates are applied by the parquet | ||
/// decoder *during* decode (see [`ArrowPredicate`] and [`RowFilter`] for more | ||
/// details). This is only enabled if `pushdown_filters` is set to true. | ||
alamb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
/// | ||
/// Note: If the predicate can not be used to accelerate the scan, it is ignored | ||
/// (no error is raised on predicate evaluation errors). | ||
/// | ||
/// [`ArrowPredicate`]: parquet::arrow::arrow_reader::ArrowPredicate | ||
/// [`RowFilter`]: parquet::arrow::arrow_reader::RowFilter | ||
/// [Parquet PageIndex]: https://github.com/apache/parquet-format/blob/master/PageIndex.md | ||
/// | ||
/// # Implementing External Indexes | ||
/// | ||
/// It is possible to restrict the row groups and selections within those row | ||
|
@@ -199,10 +220,11 @@ pub use writer::plan_to_parquet; | |
/// applying predicates to metadata. The plan and projections are used to | ||
/// determine what pages must be read. | ||
/// | ||
/// * Step 4: The stream begins reading data, fetching the required pages | ||
/// and incrementally decoding them. | ||
/// * Step 4: The stream begins reading data, fetching the required parquet | ||
/// pages incrementally decoding them, and applying any row filters (see | ||
/// [`Self::with_pushdown_filters`]). | ||
/// | ||
/// * Step 5: As each [`RecordBatch]` is read, it may be adapted by a | ||
/// * Step 5: As each [`RecordBatch`] is read, it may be adapted by a | ||
/// [`SchemaAdapter`] to match the table schema. By default missing columns are | ||
/// filled with nulls, but this can be customized via [`SchemaAdapterFactory`]. | ||
/// | ||
|
@@ -268,13 +290,10 @@ impl ParquetExecBuilder { | |
} | ||
} | ||
|
||
/// Set the predicate for the scan. | ||
/// | ||
/// The ParquetExec uses this predicate to filter row groups and data pages | ||
/// using the Parquet statistics and bloom filters. | ||
/// Set the filter predicate when reading. | ||
/// | ||
/// If the predicate can not be used to prune the scan, it is ignored (no | ||
/// error is raised). | ||
/// See the "Predicate Pushdown" section of the [`ParquetExec`] documenation | ||
/// for more details. | ||
pub fn with_predicate(mut self, predicate: Arc<dyn PhysicalExpr>) -> Self { | ||
self.predicate = Some(predicate); | ||
self | ||
|
@@ -291,7 +310,7 @@ impl ParquetExecBuilder { | |
self | ||
} | ||
|
||
/// Set the table parquet options that control how the ParquetExec reads. | ||
/// Set the options for controlling how the ParquetExec reads parquet files. | ||
/// | ||
/// See also [`Self::new_with_options`] | ||
pub fn with_table_parquet_options( | ||
|
@@ -480,11 +499,8 @@ impl ParquetExec { | |
self | ||
} | ||
|
||
/// If true, any filter [`Expr`]s on the scan will converted to a | ||
/// [`RowFilter`](parquet::arrow::arrow_reader::RowFilter) in the | ||
/// `ParquetRecordBatchStream`. These filters are applied by the | ||
/// parquet decoder to skip unecessairly decoding other columns | ||
/// which would not pass the predicate. Defaults to false | ||
/// If true, the predicate will be used during the parquet scan. | ||
/// Defaults to false | ||
/// | ||
/// [`Expr`]: datafusion_expr::Expr | ||
pub fn with_pushdown_filters(mut self, pushdown_filters: bool) -> Self { | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add an example of it? 🤔