Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
3b1b960
encode delta_byte_array
etseidl Jan 30, 2024
6b1886b
fix decl of gpuEncodeDeltaByteArrayPages
etseidl Jan 31, 2024
dc7b0e9
fix bug in delta binary encoder. wasn't handling long runs of nulls
etseidl Feb 1, 2024
7579b64
suggestion from review
etseidl Feb 1, 2024
cc657e1
Merge branch 'encode_delta_ba' of github.com:etseidl/cudf into encode…
etseidl Feb 1, 2024
4812132
make it pythonic
etseidl Feb 1, 2024
dfbe6d8
Merge branch 'encode_delta_ba' of github.com:etseidl/cudf into encode…
etseidl Feb 1, 2024
07d1af7
change variable per suggestion
etseidl Feb 1, 2024
985d50a
more review changes
etseidl Feb 1, 2024
77e3789
change another variable name
etseidl Feb 1, 2024
388ae86
add explanation of when to choose delta_byte_array
etseidl Feb 1, 2024
8709b9c
Merge branch 'branch-24.04' into encode_delta_ba
etseidl Feb 1, 2024
6b2ef14
fix the other bool assignment
etseidl Feb 1, 2024
46c5425
add delta binary test
etseidl Feb 1, 2024
b955c37
use struct rather than tuple for byte arrays
etseidl Feb 2, 2024
fb6385b
Merge remote-tracking branch 'origin/branch-24.04' into encode_delta_ba
etseidl Feb 2, 2024
5645b8b
Merge branch 'branch-24.04' into encode_delta_ba
etseidl Feb 2, 2024
858b4c9
fix bug in delta_byte_array reader
etseidl Feb 2, 2024
cf30cae
Apply suggestions from code review
etseidl Feb 2, 2024
ad873c6
more suggestions
etseidl Feb 2, 2024
9883af7
a few more cleanups
etseidl Feb 2, 2024
81f4ee2
lost a change somehow
etseidl Feb 2, 2024
3e8767b
Merge branch 'branch-24.04' into encode_delta_ba
etseidl Feb 6, 2024
8378d64
Merge branch 'rapidsai:branch-24.04' into encode_delta_ba
etseidl Feb 7, 2024
f0eccd0
Merge branch 'branch-24.04' into encode_delta_ba
etseidl Feb 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions cpp/include/cudf/io/parquet.hpp
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,9 @@ class parquet_writer_options {
std::shared_ptr<writer_compression_statistics> _compression_stats;
// write V2 page headers?
bool _v2_page_headers = false;
// Use DELTA_BYTE_ARRAY when dictionary encoding is not available rather than the default
// DELTA_LENGTH_BYTE_ARRAY
bool _prefer_delta_byte_array = false;

/**
* @brief Constructor from sink and table.
Expand Down Expand Up @@ -761,6 +764,13 @@ class parquet_writer_options {
*/
[[nodiscard]] auto is_enabled_write_v2_headers() const { return _v2_page_headers; }

/**
* @brief Returns `true` if DELTA_BYTE_ARRAY is the preferred string encoding.
*
* @return `true` if DELTA_BYTE_ARRAY is the preferred string encoding.
*/
[[nodiscard]] auto is_enabled_prefer_dba() const { return _prefer_delta_byte_array; }

/**
* @brief Sets partitions.
*
Expand Down Expand Up @@ -892,6 +902,13 @@ class parquet_writer_options {
* @param val Boolean value to enable/disable writing of V2 page headers.
*/
void enable_write_v2_headers(bool val) { _v2_page_headers = val; }

/**
* @brief Sets preference for delta encoding.
*
* @param val Boolean value to enable/disable use of DELTA_BYTE_ARRAY encoding.
*/
void enable_prefer_dba(bool val) { _prefer_delta_byte_array = val; }
};

/**
Expand Down Expand Up @@ -1143,6 +1160,20 @@ class parquet_writer_options_builder {
*/
parquet_writer_options_builder& write_v2_headers(bool enabled);

/**
* @brief Set to true if DELTA_BYTE_ARRAY encoding should be used.
*
* The default encoding for all columns is dictionary encoding. When dictionary encoding
* cannot be used (it was disabled, or the dictionary is too large), the parquet writer
* will usually fall back to PLAIN encoding. If V2 headers are enabled, however, the
* choice for fall back is DELTA_LENGTH_BYTE_ARRAY. Setting this to `true` will use
* DELTA_BYTE_ARRAY encoding instead. This will apply to all string columns.
*
* @param enabled Boolean value to enable/disable use of DELTA_BYTE_ARRAY encoding.
* @return this for chaining
*/
parquet_writer_options_builder& prefer_dba(bool enabled);

/**
* @brief move parquet_writer_options member once it's built.
*/
Expand Down Expand Up @@ -1230,6 +1261,9 @@ class chunked_parquet_writer_options {
std::shared_ptr<writer_compression_statistics> _compression_stats;
// write V2 page headers?
bool _v2_page_headers = false;
// Use DELTA_BYTE_ARRAY when dictionary encoding is not available rather than the default
// DELTA_LENGTH_BYTE_ARRAY
bool _prefer_delta_byte_array = false;

/**
* @brief Constructor from sink.
Expand Down Expand Up @@ -1384,6 +1418,13 @@ class chunked_parquet_writer_options {
*/
[[nodiscard]] auto is_enabled_write_v2_headers() const { return _v2_page_headers; }

/**
* @brief Returns `true` if DELTA_BYTE_ARRAY is the preferred string encoding.
*
* @return `true` if DELTA_BYTE_ARRAY is the preferred string encoding.
*/
[[nodiscard]] auto is_enabled_prefer_dba() const { return _prefer_delta_byte_array; }

/**
* @brief Sets metadata.
*
Expand Down Expand Up @@ -1501,6 +1542,13 @@ class chunked_parquet_writer_options {
*/
void enable_write_v2_headers(bool val) { _v2_page_headers = val; }

/**
* @brief Sets preference for delta encoding.
*
* @param val Boolean value to enable/disable use of DELTA_BYTE_ARRAY encoding.
*/
void enable_prefer_dba(bool val) { _prefer_delta_byte_array = val; }

/**
* @brief creates builder to build chunked_parquet_writer_options.
*
Expand Down Expand Up @@ -1612,6 +1660,20 @@ class chunked_parquet_writer_options_builder {
*/
chunked_parquet_writer_options_builder& write_v2_headers(bool enabled);

/**
* @brief Set to true if DELTA_BYTE_ARRAY encoding should be used.
*
* The default encoding for all columns is dictionary encoding. When dictionary encoding
* cannot be used (it was disabled, the dictionary is too large), the parquet writer
* will usually fall back to PLAIN encoding. If V2 headers are enabled, however, the
* choice for fall back is DELTA_LENGTH_BYTE_ARRAY. Setting this to `true` will use
* DELTA_BYTE_ARRAY encoding instead. This will apply to all string columns.
*
* @param enabled Boolean value to enable/disable use of DELTA_BYTE_ARRAY encoding.
* @return this for chaining
*/
chunked_parquet_writer_options_builder& prefer_dba(bool enabled);

/**
* @brief Sets the maximum row group size, in bytes.
*
Expand Down
13 changes: 13 additions & 0 deletions cpp/src/io/functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -809,6 +809,12 @@ parquet_writer_options_builder& parquet_writer_options_builder::write_v2_headers
return *this;
}

parquet_writer_options_builder& parquet_writer_options_builder::prefer_dba(bool enabled)
{
options.enable_prefer_dba(enabled);
return *this;
}

void chunked_parquet_writer_options::set_key_value_metadata(
std::vector<std::map<std::string, std::string>> metadata)
{
Expand Down Expand Up @@ -897,6 +903,13 @@ chunked_parquet_writer_options_builder& chunked_parquet_writer_options_builder::
return *this;
}

chunked_parquet_writer_options_builder& chunked_parquet_writer_options_builder::prefer_dba(
bool enabled)
{
options.enable_prefer_dba(enabled);
return *this;
}

chunked_parquet_writer_options_builder&
chunked_parquet_writer_options_builder::max_page_fragment_size(size_type val)
{
Expand Down
4 changes: 2 additions & 2 deletions cpp/src/io/parquet/delta_enc.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023, NVIDIA CORPORATION.
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -201,7 +201,7 @@ class delta_binary_packer {
if (is_valid) { _buffer[delta::rolling_idx(pos + _current_idx + _values_in_buffer)] = value; }
__syncthreads();

if (threadIdx.x == 0) {
if (num_valid > 0 && threadIdx.x == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worthwhile to add a test for the fixed bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nothing avoid the Eye of Sauron???? 🤣 Guess I'll whip something up 🧑‍🍳

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it weren't the only change in the file maybe it could have snuck by :D

_values_in_buffer += num_valid;
// if first pass write header
if (_current_idx == 0) {
Expand Down
Loading
Loading