Skip to content

Commit

Permalink
Fix bugs in handling of delta encodings (#15075)
Browse files Browse the repository at this point in the history
Part of #14938 was fixing two bugs discovered during testing. One is in the encoding of DELTA_BINARY_PACKED data where the first non-null value in a page to be encoded is not in the first batch of 129 values. The second is an error in decoding of DELTA_BYTE_ARRAY pages where, again, the first non-null value is not in the first block to be decoded.

This PR includes a test for the former, but the latter cannot be easily tested because the python API still lacks `skip_rows`, and we cannot generate DELTA_BYTE_ARRAY encoded data without the changes in #14938. A test for the latter will be added later, but the fix has been validated with data on hand locally.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)

URL: #15075
  • Loading branch information
etseidl authored Feb 22, 2024
1 parent c8dc33c commit 90b763c
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 2 deletions.
4 changes: 2 additions & 2 deletions cpp/src/io/parquet/delta_enc.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -201,7 +201,7 @@ class delta_binary_packer {
if (is_valid) { _buffer[delta::rolling_idx(pos + _current_idx + _values_in_buffer)] = value; }
__syncthreads();

if (threadIdx.x == 0) {
if (num_valid > 0 && threadIdx.x == 0) {
_values_in_buffer += num_valid;
// if first pass write header
if (_current_idx == 0) {
Expand Down
3 changes: 3 additions & 0 deletions cpp/src/io/parquet/page_string_decode.cu
Original file line number Diff line number Diff line change
Expand Up @@ -535,6 +535,9 @@ __device__ thrust::pair<size_t, size_t> totalDeltaByteArraySize(uint8_t const* d
uint32_t const idx = db->current_value_idx + i + lane_id;
if (idx >= start_value && idx < end_value && idx < db->value_count) {
lane_sum += db->value[rolling_index<delta_rolling_buf_size>(idx)];
}
// need lane_max over all values, not just in bounds
if (idx < db->value_count) {
lane_max = max(lane_max, db->value[rolling_index<delta_rolling_buf_size>(idx)]);
}
}
Expand Down
26 changes: 26 additions & 0 deletions cpp/tests/io/parquet_writer_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1426,6 +1426,32 @@ TEST_F(ParquetWriterTest, RowGroupMetadata)
static_cast<int64_t>(num_rows * sizeof(column_type)));
}

TEST_F(ParquetWriterTest, DeltaBinaryStartsWithNulls)
{
// test that the DELTA_BINARY_PACKED writer can properly encode a column that begins with
// more than 129 nulls
constexpr int num_rows = 500;
constexpr int num_nulls = 150;

auto const ones = thrust::make_constant_iterator(1);
auto valids = cudf::detail::make_counting_transform_iterator(
0, [num_nulls](auto i) { return i >= num_nulls; });
auto const col = cudf::test::fixed_width_column_wrapper<int>{ones, ones + num_rows, valids};
auto const expected = table_view({col});

auto const filepath = temp_env->get_temp_filepath("DeltaBinaryStartsWithNulls.parquet");
cudf::io::parquet_writer_options out_opts =
cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
.write_v2_headers(true)
.dictionary_policy(cudf::io::dictionary_policy::NEVER);
cudf::io::write_parquet(out_opts);

cudf::io::parquet_reader_options in_opts =
cudf::io::parquet_reader_options::builder(cudf::io::source_info{filepath});
auto result = cudf::io::read_parquet(in_opts);
CUDF_TEST_EXPECT_TABLES_EQUAL(expected, result.tbl->view());
}

/////////////////////////////////////////////////////////////
// custom mem mapped data sink that supports device writes
template <bool supports_device_writes>
Expand Down

0 comments on commit 90b763c

Please sign in to comment.