Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null_count of columns returned by chunked_parquet_reader #13111

Merged
merged 10 commits into from
Apr 13, 2023
10 changes: 6 additions & 4 deletions cpp/src/io/parquet/page_data.cu
Original file line number Diff line number Diff line change
Expand Up @@ -1018,23 +1018,25 @@ static __device__ bool setupLocalPageInfo(page_state_s* const s,
s->page.nesting_decode[thread_depth].start_depth;
s->nesting_decode_cache[thread_depth].end_depth =
s->page.nesting_decode[thread_depth].end_depth;
s->nesting_decode_cache[thread_depth].valid_count = 0;
s->nesting_decode_cache[thread_depth].value_count = 0;
s->nesting_decode_cache[thread_depth].null_count = 0;
}
depth += blockDim.x;
}
}
if (!t) {
s->nesting_info = can_use_decode_cache ? s->nesting_decode_cache : s->page.nesting_decode;
}
__syncthreads();
vuule marked this conversation as resolved.
Show resolved Hide resolved

// zero counts
int depth = 0;
while (depth < s->page.num_output_nesting_levels) {
int const thread_depth = depth + t;
if (thread_depth < s->page.num_output_nesting_levels) {
s->nesting_info[thread_depth].valid_count = 0;
s->nesting_info[thread_depth].value_count = 0;
s->nesting_info[thread_depth].null_count = 0;
s->page.nesting_decode[thread_depth].valid_count = 0;
Copy link
Contributor

@nvdbaranec nvdbaranec Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this fix . s->nesting_info points to either the values in the actual pages, or to the decode cache. Up on line 1026 this gets set. So with the original code, this loop will be clearing the correct one. In the case where we're using the cache, the value computed for null_count gets copied back to the data in the pages at the very end here:

if (s->nesting_info == s->nesting_decode_cache) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Then I suspect something is wrong with the back copy. I'll look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your help reworking the fix :)
Please take another look.

s->page.nesting_decode[thread_depth].value_count = 0;
s->page.nesting_decode[thread_depth].null_count = 0;
}
depth += blockDim.x;
}
Expand Down
17 changes: 4 additions & 13 deletions cpp/src/io/utilities/column_buffer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -241,13 +241,8 @@ std::unique_ptr<column> empty_like(column_buffer& buffer,
auto child = empty_like(buffer.children[0], child_info, stream, mr);

// make the final list column
return make_lists_column(0,
std::move(offsets),
std::move(child),
buffer._null_count,
std::move(buffer._null_mask),
stream,
mr);
return make_lists_column(
0, std::move(offsets), std::move(child), 0, rmm::device_buffer{0, stream, mr}, stream, mr);
} break;

case type_id::STRUCT: {
Expand All @@ -265,12 +260,8 @@ std::unique_ptr<column> empty_like(column_buffer& buffer,
return empty_like(col, child_info, stream, mr);
});

return make_structs_column(0,
std::move(output_children),
buffer._null_count,
std::move(buffer._null_mask),
stream,
mr);
return make_structs_column(
0, std::move(output_children), 0, rmm::device_buffer{0, stream, mr}, stream, mr);
} break;

default: return cudf::make_empty_column(buffer.type);
Expand Down
31 changes: 31 additions & 0 deletions cpp/tests/io/parquet_chunked_reader_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -920,3 +920,34 @@ TEST_F(ParquetChunkedReaderTest, TestChunkedReadWithListsOfStructs)
CUDF_TEST_EXPECT_TABLES_EQUAL(*expected_with_nulls, *result);
}
}

TEST_F(ParquetChunkedReaderTest, TestChunkedReadNullCount)
{
auto constexpr num_rows = 100'000;

auto const sequence = cudf::detail::make_counting_transform_iterator(0, [](auto i) { return 1; });
auto const validity =
cudf::detail::make_counting_transform_iterator(0, [](auto i) { return i % 4 != 3; });
cudf::test::fixed_width_column_wrapper<int32_t> col{sequence, sequence + num_rows, validity};
std::vector<std::unique_ptr<cudf::column>> cols;
cols.push_back(col.release());
auto const expected = std::make_unique<cudf::table>(std::move(cols));

auto const filepath = temp_env->get_temp_filepath("chunked_reader_null_count.parquet");
auto const page_limit_rows = num_rows / 5;
auto const write_opts =
cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, *expected)
.max_page_size_rows(page_limit_rows) // 20k rows per page
.build();
cudf::io::write_parquet(write_opts);

auto const byte_limit = page_limit_rows * sizeof(int);
auto const read_opts =
cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath}).build();
auto reader = cudf::io::chunked_parquet_reader(byte_limit, read_opts);

do {
// Every fourth row is null
EXPECT_EQ(reader.read_chunk().tbl->get_column(0).null_count(), page_limit_rows / 4);
} while (reader.has_next());
}