-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value #34112
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -211,11 +211,24 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) { | |
return page_statistics; | ||
} | ||
const format::Statistics& stats = header.statistics; | ||
if (stats.__isset.max) { | ||
page_statistics.set_max(stats.max); | ||
} | ||
if (stats.__isset.min) { | ||
page_statistics.set_min(stats.min); | ||
// Use the new V2 min-max statistics over the former one if it is filled | ||
if (stats.__isset.max_value || stats.__isset.min_value) { | ||
// TODO: check if the column_order is TYPE_DEFINED_ORDER. | ||
if (stats.__isset.max_value) { | ||
page_statistics.set_max(stats.max_value); | ||
} | ||
if (stats.__isset.min_value) { | ||
page_statistics.set_min(stats.min_value); | ||
} | ||
} else if (stats.__isset.max || stats.__isset.min) { | ||
// TODO: check created_by to see if it is corrupted for some types. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've no problem here, but just curious, does this code meaning the parquet-mr's |
||
// TODO: check if the sort_order is SIGNED. | ||
if (stats.__isset.max) { | ||
page_statistics.set_max(stats.max); | ||
} | ||
if (stats.__isset.min) { | ||
page_statistics.set_min(stats.min); | ||
} | ||
} | ||
if (stats.__isset.null_count) { | ||
page_statistics.set_null_count(stats.null_count); | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously,
page_statistics
will handle min-max separately. This patch changes it to once have all min-max, otherwise, cannot use min-maxThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, page stats without either min or max is corrupted and cannot be used. Check parquet-mr for detail: https://github.com/apache/parquet-mr/blob/5290bd5e0ee5dc30db0576e2bfc6eea335c465cf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L797
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although in parquet.thrift, min-max can exist only one. But I think handling it like this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we revert back to previous mode that only has min or max is ok?