-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Parquet] Fix parsing stats from min_value/max_value #34138
Comments
wjones127
pushed a commit
that referenced
this issue
Feb 17, 2023
…34112) ### Rationale for this change The code below does not read from stats.min_value/max_value at all. ```cpp // Extracts encoded statistics from V1 and V2 data page headers template <typename H> EncodedStatistics ExtractStatsFromHeader(const H& header) { EncodedStatistics page_statistics; if (!header.__isset.statistics) { return page_statistics; } const format::Statistics& stats = header.statistics; if (stats.__isset.max) { page_statistics.set_max(stats.max); } if (stats.__isset.min) { page_statistics.set_min(stats.min); } if (stats.__isset.null_count) { page_statistics.set_null_count(stats.null_count); } if (stats.__isset.distinct_count) { page_statistics.set_distinct_count(stats.distinct_count); } return page_statistics; } ``` ### What changes are included in this PR? Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats. ### Are these changes tested? Some test cases fail after the fix. Fixed them to make sure it is covered. ### Are there any user-facing changes? No. * Closes: #34138 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
gringasalpastor
pushed a commit
to gringasalpastor/arrow
that referenced
this issue
Feb 17, 2023
…alue (apache#34112) ### Rationale for this change The code below does not read from stats.min_value/max_value at all. ```cpp // Extracts encoded statistics from V1 and V2 data page headers template <typename H> EncodedStatistics ExtractStatsFromHeader(const H& header) { EncodedStatistics page_statistics; if (!header.__isset.statistics) { return page_statistics; } const format::Statistics& stats = header.statistics; if (stats.__isset.max) { page_statistics.set_max(stats.max); } if (stats.__isset.min) { page_statistics.set_min(stats.min); } if (stats.__isset.null_count) { page_statistics.set_null_count(stats.null_count); } if (stats.__isset.distinct_count) { page_statistics.set_distinct_count(stats.distinct_count); } return page_statistics; } ``` ### What changes are included in this PR? Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats. ### Are these changes tested? Some test cases fail after the fix. Fixed them to make sure it is covered. ### Are there any user-facing changes? No. * Closes: apache#34138 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
fatemehp
pushed a commit
to fatemehp/arrow
that referenced
this issue
Feb 24, 2023
…alue (apache#34112) ### Rationale for this change The code below does not read from stats.min_value/max_value at all. ```cpp // Extracts encoded statistics from V1 and V2 data page headers template <typename H> EncodedStatistics ExtractStatsFromHeader(const H& header) { EncodedStatistics page_statistics; if (!header.__isset.statistics) { return page_statistics; } const format::Statistics& stats = header.statistics; if (stats.__isset.max) { page_statistics.set_max(stats.max); } if (stats.__isset.min) { page_statistics.set_min(stats.min); } if (stats.__isset.null_count) { page_statistics.set_null_count(stats.null_count); } if (stats.__isset.distinct_count) { page_statistics.set_distinct_count(stats.distinct_count); } return page_statistics; } ``` ### What changes are included in this PR? Do similar thing from parquet-mr to check and read min_value/max_value from thrift stats. ### Are these changes tested? Some test cases fail after the fix. Fixed them to make sure it is covered. ### Are there any user-facing changes? No. * Closes: apache#34138 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug, including details regarding any error messages, version, and platform.
The code below does not check and read from stats.min_value/max_value. If reading from a parquet file where the stats apply min_value/max_value, we are unable to read any column statistics at all.
Component(s)
C++, Parquet
The text was updated successfully, but these errors were encountered: