Skip to content

Commit

Permalink
Fix page size estimation in Parquet writer (#13364)
Browse files Browse the repository at this point in the history
Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the `bit-packed-header` used when encoding bit-packed literal runs. This term originally used a value of `256`, but it's hard to see where that value comes from. This PR change the value to `8`, with a possible justification being the minimum length of a literal run is 8 values.  Worst case would be multiple runs of 8, with required overhead bytes  then being `num_values/8`. This also adds a test that has been verified to fail for values larger than 16 in the problematic term.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13364
  • Loading branch information
etseidl authored May 18, 2023
1 parent 58d1c0e commit 8c2f76e
Show file tree
Hide file tree
Showing 2 changed files with 51 additions and 2 deletions.
16 changes: 14 additions & 2 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -115,8 +115,17 @@ constexpr uint32_t max_RLE_page_size(uint8_t value_bit_width, uint32_t num_value
{
if (value_bit_width == 0) return 0;

// Run length = 4, max(rle/bitpack header) = 5, add one byte per 256 values for overhead
return 4 + 5 + util::div_rounding_up_unsafe(num_values * value_bit_width, 8) + (num_values / 256);
// Run length = 4, max(rle/bitpack header) = 5. bitpacking worst case is one byte every 8 values
// (because bitpacked runs are a multiple of 8). Don't need to round up the last term since that
// overhead is accounted for in the '5'.
// TODO: this formula does not take into account the data for RLE runs. The worst realistic case
// is repeated runs of 8 bitpacked, 2 RLE values. In this case, the formula would be
// 0.8 * (num_values * bw / 8 + num_values / 8) + 0.2 * (num_values / 2 * (1 + (bw+7)/8))
// for bw < 8 the above value will be larger than below, but in testing it seems like for low
// bitwidths it's hard to get the pathological 8:2 split.
// If the encoder starts printing the data corruption warning, then this will need to be
// revisited.
return 4 + 5 + util::div_rounding_up_unsafe(num_values * value_bit_width, 8) + (num_values / 8);
}

// subtract b from a, but return 0 if this would underflow
Expand Down Expand Up @@ -1302,6 +1311,9 @@ __global__ void __launch_bounds__(128, 8)
if (t == 0) {
uint8_t* base = s->page.page_data + s->page.max_hdr_size;
auto actual_data_size = static_cast<uint32_t>(s->cur - base);
if (actual_data_size > s->page.max_data_size) {
CUDF_UNREACHABLE("detected possible page data corruption");
}
s->page.max_data_size = actual_data_size;
if (not comp_in.empty()) {
comp_in[blockIdx.x] = {base, actual_data_size};
Expand Down
37 changes: 37 additions & 0 deletions cpp/tests/io/parquet_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5210,6 +5210,43 @@ TEST_F(ParquetWriterTest, DictionaryAlwaysTest)
EXPECT_TRUE(used_dict(1));
}

TEST_F(ParquetWriterTest, DictionaryPageSizeEst)
{
// one page
constexpr unsigned int nrows = 20'000U;

// this test is creating a pattern of repeating then non-repeating values to trigger
// a "worst-case" for page size estimation in the presence of a dictionary. have confirmed
// that this fails for values over 16 in the final term of `max_RLE_page_size()`.
// The output of the iterator will be 'CCCCCRRRRRCCCCCRRRRR...` where 'C' is a changing
// value, and 'R' repeats. The encoder will turn this into a literal run of 8 values
// (`CCCCCRRR`) followed by a repeated run of 2 (`RR`). This pattern then repeats, getting
// as close as possible to a condition of repeated 8 value literal runs.
auto elements0 = cudf::detail::make_counting_transform_iterator(0, [](auto i) {
if ((i / 5) % 2 == 1) {
return std::string("non-unique string");
} else {
return "a unique string value suffixed with " + std::to_string(i);
}
});
auto const col0 = cudf::test::strings_column_wrapper(elements0, elements0 + nrows);

auto const expected = table_view{{col0}};

auto const filepath = temp_env->get_temp_filepath("DictionaryPageSizeEst.parquet");
cudf::io::parquet_writer_options out_opts =
cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
.compression(cudf::io::compression_type::ZSTD)
.dictionary_policy(cudf::io::dictionary_policy::ALWAYS);
cudf::io::write_parquet(out_opts);

cudf::io::parquet_reader_options default_in_opts =
cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath});
auto const result = cudf::io::read_parquet(default_in_opts);

CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
}

TEST_P(ParquetSizedTest, DictionaryTest)
{
const unsigned int cardinality = (1 << (GetParam() - 1)) + 1;
Expand Down

0 comments on commit 8c2f76e

Please sign in to comment.