Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34138: [C++][Parquet] Fix parsing stats from min_value/max_value #34112

Merged
merged 2 commits into from
Feb 17, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 22 additions & 2 deletions cpp/src/parquet/arrow/arrow_reader_writer_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -4083,6 +4083,16 @@ TEST_P(TestArrowWriteDictionary, Statistics) {
std::vector<std::vector<std::string>> expected_min_max_ = {
{"a", "b"}, {"b", "c"}, {"a", "d"}, {"", ""}};

const std::vector<std::vector<std::vector<std::string>>> expected_min_by_page = {
{{"b", "a"}, {"b", "a"}}, {{"b", "b"}, {"b", "b"}}, {{"c", "a"}, {"c", "a"}}};
const std::vector<std::vector<std::vector<std::string>>> expected_max_by_page = {
{{"b", "a"}, {"b", "a"}}, {{"c", "c"}, {"c", "c"}}, {{"d", "a"}, {"d", "a"}}};
const std::vector<std::vector<std::vector<bool>>> expected_has_min_max_by_page = {
{{true, true}, {true, true}},
{{true, true}, {true, true}},
{{true, true}, {true, true}},
{{false}, {false}}};

for (std::size_t case_index = 0; case_index < test_dictionaries.size(); case_index++) {
SCOPED_TRACE(test_dictionaries[case_index]->type()->ToString());
ASSERT_OK_AND_ASSIGN(std::shared_ptr<::arrow::Array> dict_encoded,
Expand Down Expand Up @@ -4143,8 +4153,18 @@ TEST_P(TestArrowWriteDictionary, Statistics) {
DataPage* data_page = (DataPage*)page.get();
const EncodedStatistics& stats = data_page->statistics();
EXPECT_EQ(stats.null_count, expected_null_by_page[case_index][page_index]);
EXPECT_EQ(stats.has_min, false);
EXPECT_EQ(stats.has_max, false);

auto expect_has_min_max =
expected_has_min_max_by_page[case_index][row_group_index][page_index];
EXPECT_EQ(stats.has_min, expect_has_min_max);
EXPECT_EQ(stats.has_max, expect_has_min_max);
if (expect_has_min_max) {
EXPECT_EQ(stats.min(),
expected_min_by_page[case_index][row_group_index][page_index]);
EXPECT_EQ(stats.max(),
expected_max_by_page[case_index][row_group_index][page_index]);
}

EXPECT_EQ(data_page->num_values(),
expected_valid_by_page[case_index][page_index] +
expected_null_by_page[case_index][page_index]);
Expand Down
11 changes: 8 additions & 3 deletions cpp/src/parquet/column_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -211,10 +211,15 @@ EncodedStatistics ExtractStatsFromHeader(const H& header) {
return page_statistics;
}
const format::Statistics& stats = header.statistics;
if (stats.__isset.max) {
// Use the new V2 min-max statistics over the former one if it is filled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, page_statistics will handle min-max separately. This patch changes it to once have all min-max, otherwise, cannot use min-max

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although in parquet.thrift, min-max can exist only one. But I think handling it like this is ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we revert back to previous mode that only has min or max is ok?

if (stats.__isset.max_value && stats.__isset.min_value) {
// TODO: check if the column_order is TYPE_DEFINED_ORDER.
page_statistics.set_max(stats.max_value);
page_statistics.set_min(stats.min_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (stats.__isset.max_value && stats.__isset.min_value) {
// TODO: check if the column_order is TYPE_DEFINED_ORDER.
page_statistics.set_max(stats.max_value);
page_statistics.set_min(stats.min_value);
if (stats.__isset.max_value || stats.__isset.min_value) {
// TODO: check if the column_order is TYPE_DEFINED_ORDER.
if (stats.__isset.max_value) {
page_statistics.set_max(stats.max_value);
}
if (stats.__isset.min_value) {
page_statistics.set_min(stats.min_value);
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Please take a look again. Thanks @wjones127 !

} else if (stats.__isset.max && stats.__isset.min) {
// TODO: check created_by to see if it is corrupted for some types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no problem here, but just curious, does this code meaning the parquet-mr's CorruptStatistics.shouldIgnoreStatistics? (It's really trickey...)

// TODO: check if the sort_order is SIGNED.
page_statistics.set_max(stats.max);
}
if (stats.__isset.min) {
page_statistics.set_min(stats.min);
}
if (stats.__isset.null_count) {
Expand Down
6 changes: 5 additions & 1 deletion cpp/src/parquet/file_deserialize_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -514,7 +514,11 @@ TEST_F(TestPageSerde, DataPageV2) {

TEST_F(TestPageSerde, TestLargePageHeaders) {
int stats_size = 256 * 1024; // 256 KB
AddDummyStats(stats_size, data_page_header_);
AddDummyStats(stats_size, data_page_header_, /*fill_all_stats=*/false);

// AddDummyStats() above has only set max which results in an invalid statistics.
// Set min explicitly here to make it valid.
data_page_header_.statistics.__set_min("");

// Any number to verify metadata roundtrip
const int32_t num_rows = 4141;
Expand Down