Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Closed
wants to merge 25 commits into from

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jan 30, 2024

Description

The last piece of #13501. This adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@etseidl etseidl requested review from a team as code owners January 30, 2024 22:36
Copy link

copy-pr-bot bot commented Jan 30, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jan 30, 2024
@PointKernel PointKernel added non-breaking Non-breaking change feature request New feature or request cuIO cuIO issue 4 - Needs cuIO Reviewer labels Jan 30, 2024
@PointKernel
Copy link
Member

/ok to test

@PointKernel PointKernel requested a review from vuule January 31, 2024 18:43
@vuule
Copy link
Contributor

vuule commented Jan 31, 2024

CC @galipremsagar for the Python API changes

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/tests/io/parquet_reader_test.cpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
@vuule vuule self-requested a review February 1, 2024 21:18
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing. Builds so well on the previous delta encoding work.
Flushing the few potentially relevant comments I got; will make another pass over the core encode implementation, but it's looking great so far.

@@ -201,7 +201,7 @@ class delta_binary_packer {
if (is_valid) { _buffer[delta::rolling_idx(pos + _current_idx + _values_in_buffer)] = value; }
__syncthreads();

if (threadIdx.x == 0) {
if (num_valid > 0 && threadIdx.x == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worthwhile to add a test for the fixed bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nothing avoid the Eye of Sauron???? 🤣 Guess I'll whip something up 🧑‍🍳

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it weren't the only change in the file maybe it could have snuck by :D

};

/*
some timing results. remove when done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving a TODO for later in review process


auto const type_id = s->col.leaf_column->type().id();

auto const get_string_tuple = [type_id, s](int idx) -> thrust::pair<size_type, uint8_t const*> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this return a cudf::string_view? It's basically a (char const*, size_type) pair

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_view comes with UTF8 baggage...I'd be more inclined to using the byte_array_view since that's more in line with the physical type. But then that's using std::byte, which is just not something you'd want to iterate over. I'd be willing to return a struct that has the overlap calculation as a method...

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more nitpicks, mostly related to naming

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
Comment on lines +2297 to +2300
return {reinterpret_cast<uint8_t const*>(str.data()), str.size_bytes()};
} else if (s->col.output_as_byte_array && type_id == type_id::LIST) {
auto const str = get_element<statistics::byte_array_view>(*s->col.leaf_column, idx);
return {reinterpret_cast<uint8_t const*>(str.data()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does byte_array use unit8_t* instead of std::byte* or char*? I think either of the two are less UBish when it comes to reinterpret_casting, compared to uint8_t*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet uses unsigned bytes for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY. We're doing == rather than compare, but I'd prefer to keep the byte array representation in line with what a byte array in Parquet is.

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all suggestions. Looks great.

@vuule
Copy link
Contributor

vuule commented Feb 3, 2024

/ok to test

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think prefer_dba is a very good name. Is this really a "preference"? Can the code choose to disregard this preference? I would prefer something like use_delta_byte_array or enable_delta_byte_array, since it appears that this always uses DELTA_BYTE_ARRAY rather than DELTA_LENGTH_BYTE_ARRAY when the option is enabled.

Also, I want to see dba expanded to delta_byte_array throughout the code, to make it easier to search for this option (dba is not self-explanatory and we rarely use abbreviations). It's currently inconsistent between dba/delta_byte_array between internal/external APIs.

@@ -1370,6 +1381,7 @@ def test_delta_byte_array_roundtrip(
nrows, add_nulls, max_string_length, str_encoding, tmpdir
):
null_frequency = 0.25 if add_nulls else 0
prefer_dba = str_encoding == "DELTA_BYTE_ARRAY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the full name everywhere in this PR. prefer_dba -> prefer_delta_byte_array throughout.

Suggested change
prefer_dba = str_encoding == "DELTA_BYTE_ARRAY"
prefer_delta_byte_array = str_encoding == "DELTA_BYTE_ARRAY"

@etseidl
Copy link
Contributor Author

etseidl commented Feb 16, 2024

rethinking how to select encodings...closing for now

@etseidl etseidl closed this Feb 16, 2024
rapids-bot bot pushed a commit that referenced this pull request Feb 22, 2024
Part of #14938 was fixing two bugs discovered during testing. One is in the encoding of DELTA_BINARY_PACKED data where the first non-null value in a page to be encoded is not in the first batch of 129 values. The second is an error in decoding of DELTA_BYTE_ARRAY pages where, again, the first non-null value is not in the first block to be decoded.

This PR includes a test for the former, but the latter cannot be easily tested because the python API still lacks `skip_rows`, and we cannot generate DELTA_BYTE_ARRAY encoded data without the changes in #14938. A test for the latter will be added later, but the fix has been validated with data on hand locally.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)

URL: #15075
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2024
Re-submission of #14938. Final (delta) piece of #13501.

Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))).

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #15239
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

6 participants