Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work around incompatibilities between V2 page header handling and zStandard compression in Parquet writer #14772

Merged
merged 14 commits into from
Jan 22, 2024
Merged
14 changes: 12 additions & 2 deletions cpp/src/io/parquet/page_enc.cu
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,13 @@ constexpr int num_encode_warps = encode_block_size / cudf::detail::warp_size;

constexpr int rolling_idx(int pos) { return rolling_index<rle_buffer_size>(pos); }

// max V1 header size (27 rounded up to 8 byte boundary)
// also valid for dict page header (V1 or V2)
constexpr int MAX_V1_HDR_SIZE = 32;
etseidl marked this conversation as resolved.
Show resolved Hide resolved

// max V2 header size (49 rounded up to 8 byte boundary)
constexpr int MAX_V2_HDR_SIZE = 56;
etseidl marked this conversation as resolved.
Show resolved Hide resolved

// do not truncate statistics
constexpr int32_t NO_TRUNC_STATS = 0;

Expand Down Expand Up @@ -534,6 +541,9 @@ CUDF_KERNEL void __launch_bounds__(128)
uint32_t const t = threadIdx.x;
auto const data_page_type = write_v2_headers ? PageType::DATA_PAGE_V2 : PageType::DATA_PAGE;

// Max page header size excluding statistics
auto const max_data_page_hdr_size = write_v2_headers ? MAX_V2_HDR_SIZE : MAX_V1_HDR_SIZE;
vuule marked this conversation as resolved.
Show resolved Hide resolved

if (t == 0) {
col_g = col_desc[blockIdx.x];
ck_g = chunks[blockIdx.y][blockIdx.x];
Expand Down Expand Up @@ -584,7 +594,7 @@ CUDF_KERNEL void __launch_bounds__(128)
page_g.chunk = &chunks[blockIdx.y][blockIdx.x];
page_g.chunk_id = blockIdx.y * num_columns + blockIdx.x;
page_g.hdr_size = 0;
page_g.max_hdr_size = 32;
page_g.max_hdr_size = MAX_V1_HDR_SIZE;
page_g.max_data_size = ck_g.uniq_data_size;
page_g.start_row = cur_row;
page_g.num_rows = ck_g.num_dict_entries;
Expand Down Expand Up @@ -684,7 +694,7 @@ CUDF_KERNEL void __launch_bounds__(128)
page_g.chunk_id = blockIdx.y * num_columns + blockIdx.x;
page_g.page_type = data_page_type;
page_g.hdr_size = 0;
page_g.max_hdr_size = 32; // Max size excluding statistics
page_g.max_hdr_size = max_data_page_hdr_size; // Max size excluding statistics
if (ck_g.stats) {
uint32_t stats_hdr_len = 16;
if (col_g.stats_dtype == dtype_string || col_g.stats_dtype == dtype_byte_array) {
Expand Down
3 changes: 3 additions & 0 deletions cpp/src/io/parquet/writer_impl.cu
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -2184,6 +2184,9 @@ writer::impl::impl(std::vector<std::unique_ptr<data_sink>> sinks,
if (options.get_metadata()) {
_table_meta = std::make_unique<table_input_metadata>(*options.get_metadata());
}
if (_write_v2_headers and _compression == Compression::ZSTD) {
CUDF_FAIL("V2 page headers cannot be used with ZSTD compression");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need the same check for the chunked case as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Or should the test (and the lines above it) go in init_state() instead? (And maybe change the name of init_state).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I'm late with the answer:)
Side note: the current code in init_state should not be there (breaks the strong exception guarantee)

}
init_state();
}

Expand Down
Loading