-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune #10051
Conversation
and fix NOT NULL prune
None | ||
// see https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982 | ||
let mut first_row_index = self | ||
.col_offset_indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do this same calculation without allocating intermediate vec s-- something like this:
let row_count_per_page = self
.col_offset_indexes
.windows(2)
.map(|location| Some(location[1].first_row_index - location[0].first_row_index));
Some(Arc::new(Int64Array::from_iter(row_count_per_page)))
🤔 the name col_offset_indexes
is somewhat confusing to me as they are PageLocation
s --
col_offset_indexes: &'a Vec<PageLocation>,
maybe we could rename that field to page_locations
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Ted-Jiang -- this is looking good. I think it could be merged as is, though I also have several suggestions that I think will make the code and comments better. We can do so as a follow on PR though.
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> { | |||
} | |||
Scenario::WithNullValues => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the "some null/some not null" null values case to this scenario as well?
make_int_batches_with_null(5, 1, 6),
} | ||
Scenario::WithNullValuesPageLevel => { | ||
vec![ | ||
make_int_batches_with_null(5, 1, 6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this scenario should have at least one batch of all nulls (e.g. make_int_batches_with_nulls(5, 0, 0)
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this will be prune by row group -> column level
no_null_values_start: usize, | ||
no_null_values_end: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to simplify this method by sending in 2 parameters. Right now it looks like it interleaves nulls arbitrarily but really the nulls are always at the start and non nulls are at the end
num_nulls: usize,
non_nulls: usize
// row_group4: page1(1~5), page2(All Null) | ||
// total 40 rows | ||
async fn test_pages_with_null_values() { | ||
test_prune( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to convince myself that this was actually setting up the scenario as described. I eventually found it here:
I wonder if it would be possible to add some better comments to test_prune
mentioning that the created parquet files have 5 rows per page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will make another pr to improve this tests
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Thanks again @Ted-Jiang -- I took another look through this PR and it is very nice 👌 |
…L prune (apache#10051) * Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune * fix clippy * Update datafusion/core/src/physical_optimizer/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * remove allocate vec * better way avoid allocate vec * simply expr --------- Co-authored-by: Andrew Lamb <[email protected]>
and fix NOT NULL prune
Which issue does this PR close?
Closes #9961
Closes #10049.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?