Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of null counts for Parquet statistics #12938

Merged
merged 10 commits into from
Mar 17, 2023
1 change: 1 addition & 0 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,7 @@ __global__ void __launch_bounds__(128)
g.col = ck_g->col_desc;
g.start_row = fragments[frag_id].start_value_idx;
g.num_rows = fragments[frag_id].num_leaf_values;
g.non_leaf_nulls = fragments[frag_id].num_values - g.num_rows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will account for all nulls above the leaf level, even if we have nested lists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. For instance, here's a dump of the first test case:

value 1: R:0 D:2 V:<null>
value 2: R:1 D:3 V:2
value 3: R:1 D:2 V:<null>
value 4: R:0 D:1 V:<null>
value 5: R:0 D:3 V:4
value 6: R:1 D:3 V:5
value 7: R:0 D:0 V:<null>

The R:0:D1 value is an empty list, while R0:D0 is a null list. The D2 nulls are leaves. Likewise for list of list:

value 1:  R:0 D:5 V:1
value 2:  R:2 D:5 V:2
value 3:  R:2 D:5 V:3
value 4:  R:1 D:3 V:<null>
value 5:  R:1 D:5 V:4
value 6:  R:2 D:5 V:5
value 7:  R:1 D:2 V:<null>
value 8:  R:1 D:4 V:<null>
value 9:  R:2 D:5 V:6
value 10: R:2 D:4 V:<null>
value 11: R:0 D:5 V:7
value 12: R:2 D:5 V:8
value 13: R:0 D:1 V:<null>
value 14: R:0 D:0 V:<null>

groups[frag_id] = g;
}
}
Expand Down
10 changes: 8 additions & 2 deletions cpp/src/io/statistics/column_statistics.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
* Copyright (c) 2021-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -129,7 +129,13 @@ struct calculate_group_statistics_functor {

chunk = block_reduce(chunk, storage);

if (t == 0) { s.ck = get_untyped_chunk(chunk); }
if (t == 0) {
// parquet wants total null count in stats, not just count of null leaf values
if constexpr (IO == detail::io_file_format::PARQUET) {
chunk.null_count += s.group.non_leaf_nulls;
}
s.ck = get_untyped_chunk(chunk);
}
}

template <typename T,
Expand Down
3 changes: 2 additions & 1 deletion cpp/src/io/statistics/statistics.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2022, NVIDIA CORPORATION.
* Copyright (c) 2019-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -111,6 +111,7 @@ struct statistics_group {
const stats_column_desc* col; //!< Column information
uint32_t start_row; //!< Start row of this group
uint32_t num_rows; //!< Number of rows in group
uint32_t non_leaf_nulls; //!< Number of null non-leaf values in the group
};

struct statistics_merge_group {
Expand Down
138 changes: 138 additions & 0 deletions cpp/tests/io/parquet_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4279,6 +4279,9 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexNulls)
auto const ci = read_column_index(source, chunk);
auto const stats = parse_statistics(chunk);

// should be half nulls, except no nulls in column 0
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I weakly prefer this spelling:

Suggested change
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0);
EXPECT_EQ(stats.null_count, c == 0 ? 0 : num_rows / 2);


// schema indexing starts at 1
auto const ptype = fmd.schema[c + 1].type;
auto const ctype = fmd.schema[c + 1].converted_type;
Expand Down Expand Up @@ -4364,6 +4367,9 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexNullColumn)
auto const ci = read_column_index(source, chunk);
auto const stats = parse_statistics(chunk);

// there should be no nulls except column 1 which is all nulls
EXPECT_EQ(stats.null_count, c == 1 ? num_rows : 0);

// schema indexing starts at 1
auto const ptype = fmd.schema[c + 1].type;
auto const ctype = fmd.schema[c + 1].converted_type;
Expand Down Expand Up @@ -4465,6 +4471,138 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexStruct)
}
}

TEST_F(ParquetWriterTest, CheckColumnIndexListWithNulls)
{
auto null_at_even_idx =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the places where this is used are only nulling out one or two values. Could we just write using cudf::test::iterators::nulls_at; and then nulls_at({0, 2}) or similar in those locations? If there was a longer list, then I would support using an iterator, but not for 1-2 values where being explicit is still concise. I don't feel strongly here, so we could leave this as-is if you prefer.

cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; });
auto null_at_idx_3 = cudf::test::iterators::null_at(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we typically write using cudf::test::iterators::null_at; and then write null_at(3) in the appropriate locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I get rid of the declaration here and just use null_at(3) inline everywhere? (as with nulls_at({0, 2}) below?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that would be fine.


using lcw = cudf::test::lists_column_wrapper<int32_t>;

// 4 nulls
// [NULL, 2, NULL]
// []
// [4, 5]
// NULL
lcw col0{{{{1, 2, 3}, null_at_even_idx}, {}, {4, 5}, {}}, null_at_idx_3};

// 4 nulls
// [[1, 2, 3], [], [4, 5], [], [0, 6, 0]]
// [[7, 8]]
// []
// [[]]
lcw col1{{{1, 2, 3}, {}, {4, 5}, {}, {0, 6, 0}}, {{7, 8}}, lcw{}, lcw{lcw{}}};

// 4 nulls
// [[1, 2, 3], [], [4, 5], NULL, [0, 6, 0]]
// [[7, 8]]
// []
// [[]]
lcw col2{{{{1, 2, 3}, {}, {4, 5}, {}, {0, 6, 0}}, null_at_idx_3}, {{7, 8}}, lcw{}, lcw{lcw{}}};

// 6 nulls
// [[1, 2, 3], [], [4, 5], NULL, [NULL, 6, NULL]]
// [[7, 8]]
// []
// [[]]
using dlcw = cudf::test::lists_column_wrapper<double>;
dlcw col3{{{{1., 2., 3.}, {}, {4., 5.}, {}, {{0., 6., 0.}, null_at_even_idx}}, null_at_idx_3},
{{7., 8.}},
dlcw{},
dlcw{dlcw{}}};

// 4 nulls
// [[1, 2, 3], [], [4, 5], NULL, [0, 6, 0]]
// [[7, 8]]
// []
// NULL
using ui16lcw = cudf::test::lists_column_wrapper<uint16_t>;
cudf::test::lists_column_wrapper<uint16_t> col4{
{{{{1, 2, 3}, {}, {4, 5}, {}, {0, 6, 0}}, null_at_idx_3},
{{7, 8}},
ui16lcw{},
ui16lcw{ui16lcw{}}},
null_at_idx_3};

// 6 nulls
// [[1, 2, 3], [], [4, 5], NULL, [NULL, 6, NULL]]
// [[7, 8]]
// []
// NULL
lcw col5{{{{{1, 2, 3}, {}, {4, 5}, {}, {{0, 6, 0}, null_at_even_idx}}, null_at_idx_3},
{{7, 8}},
lcw{},
lcw{lcw{}}},
null_at_idx_3};

// 4 nulls
using strlcw = cudf::test::lists_column_wrapper<cudf::string_view>;
cudf::test::lists_column_wrapper<cudf::string_view> col6{
{{"Monday", "Monday", "Friday"}, {}, {"Monday", "Friday"}, {}, {"Sunday", "Funday"}},
bdice marked this conversation as resolved.
Show resolved Hide resolved
{{"bee", "sting"}},
strlcw{},
strlcw{strlcw{}}};

// 11 nulls
// [[[NULL,2,NULL,4]], [[NULL,6,NULL], [8,9]]]
// [NULL, [[13],[14,15,16]], NULL]
// [NULL, [], NULL, [[]]]
// NULL
lcw col7{{
{{{{1, 2, 3, 4}, null_at_even_idx}}, {{{5, 6, 7}, null_at_even_idx}, {8, 9}}},
{{{{10, 11}, {12}}, {{13}, {14, 15, 16}}, {{17, 18}}}, null_at_even_idx},
{{lcw{lcw{}}, lcw{}, lcw{}, lcw{lcw{}}}, null_at_even_idx},
lcw{lcw{lcw{}}},
},
null_at_idx_3};

table_view expected({col0, col1, col2, col3, col4, col5, col6, col7});

int64_t const expected_null_counts[] = {4, 4, 4, 6, 4, 6, 4, 11};

auto const filepath = temp_env->get_temp_filepath("ColumnIndexListWithNulls.parquet");
auto out_opts = cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
.stats_level(cudf::io::statistics_freq::STATISTICS_COLUMN)
.compression(cudf::io::compression_type::NONE);

cudf::io::write_parquet(out_opts);

auto const source = cudf::io::datasource::create(filepath);
cudf::io::parquet::FileMetaData fmd;

read_footer(source, &fmd);

for (size_t r = 0; r < fmd.row_groups.size(); r++) {
auto const& rg = fmd.row_groups[r];
for (size_t c = 0; c < rg.columns.size(); c++) {
auto const& chunk = rg.columns[c];

// loop over offsets, read each page header, make sure it's a data page and that
// the first row index is correct
auto const oi = read_offset_index(source, chunk);

int64_t num_vals = 0;
for (size_t o = 0; o < oi.page_locations.size(); o++) {
auto const& page_loc = oi.page_locations[o];
auto const ph = read_page_header(source, page_loc);
EXPECT_EQ(ph.type, cudf::io::parquet::PageType::DATA_PAGE);
// last column has 2 values per row
EXPECT_EQ(page_loc.first_row_index * (c == rg.columns.size() - 1 ? 2 : 1), num_vals);
num_vals += ph.data_page_header.num_values;
}

// check null counts in column chunk stats and page indexes
auto const ci = read_column_index(source, chunk);
auto const stats = parse_statistics(chunk);
EXPECT_EQ(stats.null_count, expected_null_counts[c]);

// should only be one page
EXPECT_FALSE(ci.null_pages[0]);
EXPECT_EQ(ci.null_counts[0], expected_null_counts[c]);
}
}
}

TEST_F(ParquetWriterTest, CheckColumnIndexTruncation)
{
const char* coldata[] = {
Expand Down