-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix calculation of null counts for Parquet statistics #12938
Fix calculation of null counts for Parquet statistics #12938
Conversation
…feature/parquet_nulls
Pull requests from external contributors require approval from a |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few nitpicks in tests
cpp/tests/io/parquet_test.cpp
Outdated
TEST_F(ParquetWriterTest, CheckColumnIndexListWithNulls) | ||
{ | ||
auto valids = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; }); | ||
auto valids2 = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 3; }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use our special test iterator here
auto valids2 = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 3; }); | |
auto null_at_3 = cudf::test::iterators::null_at(3); |
@@ -282,6 +282,7 @@ __global__ void __launch_bounds__(128) | |||
g.col = ck_g->col_desc; | |||
g.start_row = fragments[frag_id].start_value_idx; | |||
g.num_rows = fragments[frag_id].num_leaf_values; | |||
g.non_leaf_nulls = fragments[frag_id].num_values - g.num_rows; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will account for all nulls above the leaf level, even if we have nested lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. For instance, here's a dump of the first test case:
value 1: R:0 D:2 V:<null>
value 2: R:1 D:3 V:2
value 3: R:1 D:2 V:<null>
value 4: R:0 D:1 V:<null>
value 5: R:0 D:3 V:4
value 6: R:1 D:3 V:5
value 7: R:0 D:0 V:<null>
The R:0:D1 value is an empty list, while R0:D0 is a null list. The D2 nulls are leaves. Likewise for list of list:
value 1: R:0 D:5 V:1
value 2: R:2 D:5 V:2
value 3: R:2 D:5 V:3
value 4: R:1 D:3 V:<null>
value 5: R:1 D:5 V:4
value 6: R:2 D:5 V:5
value 7: R:1 D:2 V:<null>
value 8: R:1 D:4 V:<null>
value 9: R:2 D:5 V:6
value 10: R:2 D:4 V:<null>
value 11: R:0 D:5 V:7
value 12: R:2 D:5 V:8
value 13: R:0 D:1 V:<null>
value 14: R:0 D:0 V:<null>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small suggestions, mostly for tests. Nothing is blocking here, so I'll approve.
cpp/tests/io/parquet_test.cpp
Outdated
@@ -4279,6 +4279,9 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexNulls) | |||
auto const ci = read_column_index(source, chunk); | |||
auto const stats = parse_statistics(chunk); | |||
|
|||
// should be half nulls, except no nulls in column 0 | |||
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I weakly prefer this spelling:
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0); | |
EXPECT_EQ(stats.null_count, c == 0 ? 0 : num_rows / 2); |
cpp/tests/io/parquet_test.cpp
Outdated
{ | ||
auto null_at_even_idx = | ||
cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; }); | ||
auto null_at_idx_3 = cudf::test::iterators::null_at(3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we typically write using cudf::test::iterators::null_at;
and then write null_at(3)
in the appropriate locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I get rid of the declaration here and just use null_at(3)
inline everywhere? (as with nulls_at({0, 2})
below?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that would be fine.
cpp/tests/io/parquet_test.cpp
Outdated
@@ -4465,6 +4471,138 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexStruct) | |||
} | |||
} | |||
|
|||
TEST_F(ParquetWriterTest, CheckColumnIndexListWithNulls) | |||
{ | |||
auto null_at_even_idx = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the places where this is used are only nulling out one or two values. Could we just write using cudf::test::iterators::nulls_at;
and then nulls_at({0, 2})
or similar in those locations? If there was a longer list, then I would support using an iterator, but not for 1-2 values where being explicit is still concise. I don't feel strongly here, so we could leave this as-is if you prefer.
/ok to test |
/merge |
Description
The current Parquet writer sometimes generates wrong values for
null_count
in the column chunk statistics and page indexes. This occurs for nested schemas when nulls occur at a level above the leaf values. This PR fixes the calculation by adding anon_leaf_nulls
field to thestatistics_group
struct. This field is added to the chunknull_count
calculated over leaf values ingpu_calculate_group_statistics()
.Checklist