Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix calculation of null counts for Parquet statistics #12938

Merged
merged 10 commits into from
Mar 17, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Mar 13, 2023

Description

The current Parquet writer sometimes generates wrong values for null_count in the column chunk statistics and page indexes. This occurs for nested schemas when nulls occur at a level above the leaf values. This PR fixes the calculation by adding a non_leaf_nulls field to the statistics_group struct. This field is added to the chunk null_count calculated over leaf values in gpu_calculate_group_statistics().

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested a review from a team as a code owner March 13, 2023 23:40
@etseidl etseidl requested review from harrism and vuule March 13, 2023 23:40
@rapids-bot
Copy link

rapids-bot bot commented Mar 13, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write or admin permissions before CI can begin.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 13, 2023
@vyasr vyasr added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Mar 13, 2023
@vyasr
Copy link
Contributor

vyasr commented Mar 13, 2023

/ok to test

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few nitpicks in tests

TEST_F(ParquetWriterTest, CheckColumnIndexListWithNulls)
{
auto valids = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; });
auto valids2 = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 3; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use our special test iterator here

Suggested change
auto valids2 = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i != 3; });
auto null_at_3 = cudf::test::iterators::null_at(3);

@@ -282,6 +282,7 @@ __global__ void __launch_bounds__(128)
g.col = ck_g->col_desc;
g.start_row = fragments[frag_id].start_value_idx;
g.num_rows = fragments[frag_id].num_leaf_values;
g.non_leaf_nulls = fragments[frag_id].num_values - g.num_rows;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will account for all nulls above the leaf level, even if we have nested lists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. For instance, here's a dump of the first test case:

value 1: R:0 D:2 V:<null>
value 2: R:1 D:3 V:2
value 3: R:1 D:2 V:<null>
value 4: R:0 D:1 V:<null>
value 5: R:0 D:3 V:4
value 6: R:1 D:3 V:5
value 7: R:0 D:0 V:<null>

The R:0:D1 value is an empty list, while R0:D0 is a null list. The D2 nulls are leaves. Likewise for list of list:

value 1:  R:0 D:5 V:1
value 2:  R:2 D:5 V:2
value 3:  R:2 D:5 V:3
value 4:  R:1 D:3 V:<null>
value 5:  R:1 D:5 V:4
value 6:  R:2 D:5 V:5
value 7:  R:1 D:2 V:<null>
value 8:  R:1 D:4 V:<null>
value 9:  R:2 D:5 V:6
value 10: R:2 D:4 V:<null>
value 11: R:0 D:5 V:7
value 12: R:2 D:5 V:8
value 13: R:0 D:1 V:<null>
value 14: R:0 D:0 V:<null>

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small suggestions, mostly for tests. Nothing is blocking here, so I'll approve.

@@ -4279,6 +4279,9 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexNulls)
auto const ci = read_column_index(source, chunk);
auto const stats = parse_statistics(chunk);

// should be half nulls, except no nulls in column 0
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I weakly prefer this spelling:

Suggested change
EXPECT_EQ(stats.null_count, c > 0 ? num_rows / 2 : 0);
EXPECT_EQ(stats.null_count, c == 0 ? 0 : num_rows / 2);

{
auto null_at_even_idx =
cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 2; });
auto null_at_idx_3 = cudf::test::iterators::null_at(3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we typically write using cudf::test::iterators::null_at; and then write null_at(3) in the appropriate locations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I get rid of the declaration here and just use null_at(3) inline everywhere? (as with nulls_at({0, 2}) below?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that would be fine.

@@ -4465,6 +4471,138 @@ TEST_F(ParquetWriterTest, CheckColumnOffsetIndexStruct)
}
}

TEST_F(ParquetWriterTest, CheckColumnIndexListWithNulls)
{
auto null_at_even_idx =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the places where this is used are only nulling out one or two values. Could we just write using cudf::test::iterators::nulls_at; and then nulls_at({0, 2}) or similar in those locations? If there was a longer list, then I would support using an iterator, but not for 1-2 values where being explicit is still concise. I don't feel strongly here, so we could leave this as-is if you prefer.

cpp/tests/io/parquet_test.cpp Show resolved Hide resolved
@harrism harrism removed their request for review March 15, 2023 04:45
@ttnghia
Copy link
Contributor

ttnghia commented Mar 16, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Mar 17, 2023

/merge

@rapids-bot rapids-bot bot merged commit 3540613 into rapidsai:branch-23.04 Mar 17, 2023
@etseidl etseidl deleted the feature/parquet_nulls branch March 17, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants