-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value #34112
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) { | |||||||||||||||||||||||||
return page_statistics; | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
const format::Statistics& stats = header.statistics; | ||||||||||||||||||||||||||
if (stats.__isset.max) { | ||||||||||||||||||||||||||
// Use the new V2 min-max statistics over the former one if it is filled | ||||||||||||||||||||||||||
if (stats.__isset.max_value && stats.__isset.min_value) { | ||||||||||||||||||||||||||
// TODO: check if the column_order is TYPE_DEFINED_ORDER. | ||||||||||||||||||||||||||
page_statistics.set_max(stats.max_value); | ||||||||||||||||||||||||||
page_statistics.set_min(stats.min_value); | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed. Please take a look again. Thanks @wjones127 ! |
||||||||||||||||||||||||||
} else if (stats.__isset.max && stats.__isset.min) { | ||||||||||||||||||||||||||
// TODO: check created_by to see if it is corrupted for some types. | ||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've no problem here, but just curious, does this code meaning the parquet-mr's |
||||||||||||||||||||||||||
// TODO: check if the sort_order is SIGNED. | ||||||||||||||||||||||||||
page_statistics.set_max(stats.max); | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
if (stats.__isset.min) { | ||||||||||||||||||||||||||
page_statistics.set_min(stats.min); | ||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||
if (stats.__isset.null_count) { | ||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously,
page_statistics
will handle min-max separately. This patch changes it to once have all min-max, otherwise, cannot use min-maxThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, page stats without either min or max is corrupted and cannot be used. Check parquet-mr for detail: https://github.com/apache/parquet-mr/blob/5290bd5e0ee5dc30db0576e2bfc6eea335c465cf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L797
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although in parquet.thrift, min-max can exist only one. But I think handling it like this is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we revert back to previous mode that only has min or max is ok?