Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix page size estimation in Parquet writer #13364

Merged
merged 6 commits into from
May 18, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,10 @@ constexpr uint32_t max_RLE_page_size(uint8_t value_bit_width, uint32_t num_value
{
if (value_bit_width == 0) return 0;

// Run length = 4, max(rle/bitpack header) = 5, add one byte per 256 values for overhead
return 4 + 5 + util::div_rounding_up_unsafe(num_values * value_bit_width, 8) + (num_values / 256);
// Run length = 4, max(rle/bitpack header) = 5. bitpacking worst case is one byte every 8 values
// (because bitpacked runs are a multiple of 8). Don't need to round up the last term since that
// overhead is accounted for in the '5'.
return 4 + 5 + util::div_rounding_up_unsafe(num_values * value_bit_width, 8) + (num_values / 8);
}

// subtract b from a, but return 0 if this would underflow
Expand Down Expand Up @@ -1302,6 +1304,9 @@ __global__ void __launch_bounds__(128, 8)
if (t == 0) {
uint8_t* base = s->page.page_data + s->page.max_hdr_size;
auto actual_data_size = static_cast<uint32_t>(s->cur - base);
if (actual_data_size > s->page.max_data_size) {
CUDF_UNREACHABLE("detected possible page data corruption");
}
s->page.max_data_size = actual_data_size;
if (not comp_in.empty()) {
comp_in[blockIdx.x] = {base, actual_data_size};
Expand Down
37 changes: 37 additions & 0 deletions cpp/tests/io/parquet_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5210,6 +5210,43 @@ TEST_F(ParquetWriterTest, DictionaryAlwaysTest)
EXPECT_TRUE(used_dict(1));
}

TEST_F(ParquetWriterTest, DictionaryPageSizeEst)
{
// one page
constexpr unsigned int nrows = 20'000U;

// this test is creating a pattern of repeating then non-repeating values to trigger
// a "worst-case" for page size estimation in the presence of a dictionary. have confirmed
// that this fails for values over 16 in the final term of `max_RLE_page_size()`.
// The output of the iterator will be 'CCCCCRRRRRCCCCCRRRRR...` where 'C' is a changing
// value, and 'R' repeats. The encoder will turn this into a literal run of 8 values
// (`CCCCCRRR`) followed by a repeated run of 2 (`RR`). This pattern then repeats, getting
// as close as possible to a condition of repeated 8 value literal runs.
auto elements0 = cudf::detail::make_counting_transform_iterator(0, [](auto i) {
if ((i / 5) % 2 == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the pattern changed every five elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked 🤣 This was totally empirical...I changed the divisor until I started getting overwrites in the page encoder. Printing out the run lengths, it seemed to give literal runs of 8 followed by repeated runs of 2. Pretty close to ideal for the purposes of the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seemed to give literal runs of 8 followed by repeated runs of 2

Can you please add this info to the comment above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking seriously about it, the pattern is CCCCCRRRRRCCCCCRRRRR..., where C changes, R is repeating. The encoder will take the first CCCCCRRR and emit a literal run of 8, followed by a repeated run of RR. Then this repeats with another CCCCCRRR, RR. I'd like to say this was rationally designed, but I got lucky 😄

return std::string("non-unique string");
} else {
return "a unique string value suffixed with " + std::to_string(i);
}
});
auto const col0 = cudf::test::strings_column_wrapper(elements0, elements0 + nrows);

auto const expected = table_view{{col0}};

auto const filepath = temp_env->get_temp_filepath("DictionaryPageSizeEst.parquet");
cudf::io::parquet_writer_options out_opts =
cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
.compression(cudf::io::compression_type::ZSTD)
.dictionary_policy(cudf::io::dictionary_policy::ALWAYS);
cudf::io::write_parquet(out_opts);

cudf::io::parquet_reader_options default_in_opts =
cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath});
auto const result = cudf::io::read_parquet(default_in_opts);

CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
}

TEST_P(ParquetSizedTest, DictionaryTest)
{
const unsigned int cardinality = (1 << (GetParam() - 1)) + 1;
Expand Down