Encode DELTA_BYTE_ARRAY in Parquet writer #14938

etseidl · 2024-01-30T22:36:51Z

Description

The last piece of #13501. This adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-01-30T22:36:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

PointKernel · 2024-01-31T18:08:08Z

/ok to test

cpp/src/io/parquet/page_enc.cu

vuule · 2024-01-31T20:19:17Z

CC @galipremsagar for the Python API changes

vuule

partial review

python/cudf/cudf/utils/ioutils.py

cpp/include/cudf/io/parquet.hpp

python/cudf/cudf/tests/test_parquet.py

cpp/tests/io/parquet_reader_test.cpp

cpp/include/cudf/io/parquet.hpp

cpp/tests/io/parquet_reader_test.cpp

cpp/src/io/parquet/writer_impl.cu

cpp/src/io/parquet/page_enc.cu

at the beginning of a page correctly

Co-authored-by: Vukasin Milovanovic <[email protected]>

…_delta_ba

Co-authored-by: Vukasin Milovanovic <[email protected]>

…_delta_ba

vuule

This is amazing. Builds so well on the previous delta encoding work.
Flushing the few potentially relevant comments I got; will make another pass over the core encode implementation, but it's looking great so far.

vuule · 2024-02-01T21:17:50Z

cpp/src/io/parquet/delta_enc.cuh

@@ -201,7 +201,7 @@ class delta_binary_packer {
    if (is_valid) { _buffer[delta::rolling_idx(pos + _current_idx + _values_in_buffer)] = value; }
    __syncthreads();

-    if (threadIdx.x == 0) {
+    if (num_valid > 0 && threadIdx.x == 0) {


would it be worthwhile to add a test for the fixed bug?

Does nothing avoid the Eye of Sauron???? 🤣 Guess I'll whip something up 🧑‍🍳

if it weren't the only change in the file maybe it could have snuck by :D

vuule · 2024-02-01T22:19:06Z

cpp/src/io/parquet/page_enc.cu

+  };
+
+  /*
+    some timing results. remove when done


leaving a TODO for later in review process

vuule · 2024-02-01T22:24:59Z

cpp/src/io/parquet/page_enc.cu

+
+  auto const type_id = s->col.leaf_column->type().id();
+
+  auto const get_string_tuple = [type_id, s](int idx) -> thrust::pair<size_type, uint8_t const*> {


can this return a cudf::string_view? It's basically a (char const*, size_type) pair

string_view comes with UTF8 baggage...I'd be more inclined to using the byte_array_view since that's more in line with the physical type. But then that's using std::byte, which is just not something you'd want to iterate over. I'd be willing to return a struct that has the overlap calculation as a method...

vuule

few more nitpicks, mostly related to naming

cpp/src/io/parquet/page_enc.cu

vuule · 2024-02-02T20:49:12Z

cpp/src/io/parquet/page_enc.cu

+      return {reinterpret_cast<uint8_t const*>(str.data()), str.size_bytes()};
+    } else if (s->col.output_as_byte_array && type_id == type_id::LIST) {
+      auto const str = get_element<statistics::byte_array_view>(*s->col.leaf_column, idx);
+      return {reinterpret_cast<uint8_t const*>(str.data()),


Why does byte_array use unit8_t* instead of std::byte* or char*? I think either of the two are less UBish when it comes to reinterpret_casting, compared to uint8_t*.

Parquet uses unsigned bytes for BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY. We're doing == rather than compare, but I'd prefer to keep the byte array representation in line with what a byte array in Parquet is.

cpp/src/io/parquet/page_enc.cu

Co-authored-by: Vukasin Milovanovic <[email protected]>

vuule

Thank you for addressing all suggestions. Looks great.

vuule · 2024-02-03T00:33:14Z

/ok to test

PointKernel

LGTM

bdice

I don't think prefer_dba is a very good name. Is this really a "preference"? Can the code choose to disregard this preference? I would prefer something like use_delta_byte_array or enable_delta_byte_array, since it appears that this always uses DELTA_BYTE_ARRAY rather than DELTA_LENGTH_BYTE_ARRAY when the option is enabled.

Also, I want to see dba expanded to delta_byte_array throughout the code, to make it easier to search for this option (dba is not self-explanatory and we rarely use abbreviations). It's currently inconsistent between dba/delta_byte_array between internal/external APIs.

bdice · 2024-02-07T18:28:34Z

python/cudf/cudf/tests/test_parquet.py

@@ -1370,6 +1381,7 @@ def test_delta_byte_array_roundtrip(
    nrows, add_nulls, max_string_length, str_encoding, tmpdir
 ):
    null_frequency = 0.25 if add_nulls else 0
+    prefer_dba = str_encoding == "DELTA_BYTE_ARRAY"


I'd prefer the full name everywhere in this PR. prefer_dba -> prefer_delta_byte_array throughout.

Suggested change

prefer_dba = str_encoding == "DELTA_BYTE_ARRAY"

prefer_delta_byte_array = str_encoding == "DELTA_BYTE_ARRAY"

etseidl · 2024-02-16T16:52:52Z

rethinking how to select encodings...closing for now

Part of #14938 was fixing two bugs discovered during testing. One is in the encoding of DELTA_BINARY_PACKED data where the first non-null value in a page to be encoded is not in the first batch of 129 values. The second is an error in decoding of DELTA_BYTE_ARRAY pages where, again, the first non-null value is not in the first block to be decoded. This PR includes a test for the former, but the latter cannot be easily tested because the python API still lacks `skip_rows`, and we cannot generate DELTA_BYTE_ARRAY encoded data without the changes in #14938. A test for the latter will be added later, but the fix has been validated with data on hand locally. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - David Wendt (https://github.com/davidwendt) URL: #15075

Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239

encode delta_byte_array

3b1b960

etseidl requested review from a team as code owners January 30, 2024 22:36

etseidl requested review from shwina, charlesbluca, harrism and PointKernel January 30, 2024 22:36

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jan 30, 2024

PointKernel added non-breaking Non-breaking change feature request New feature or request cuIO cuIO issue 4 - Needs cuIO Reviewer labels Jan 30, 2024

PointKernel requested a review from vuule January 31, 2024 18:43

fix decl of gpuEncodeDeltaByteArrayPages

6b1886b

etseidl commented Jan 31, 2024

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

vuule reviewed Feb 1, 2024

View reviewed changes

etseidl and others added 10 commits January 31, 2024 21:33

fix bug in delta binary encoder. wasn't handling long runs of nulls

dc7b0e9

at the beginning of a page correctly

suggestion from review

7579b64

Co-authored-by: Vukasin Milovanovic <[email protected]>

Merge branch 'encode_delta_ba' of github.com:etseidl/cudf into encode…

cc657e1

…_delta_ba

make it pythonic

4812132

Co-authored-by: Vukasin Milovanovic <[email protected]>

Merge branch 'encode_delta_ba' of github.com:etseidl/cudf into encode…

dfbe6d8

…_delta_ba

change variable per suggestion

07d1af7

more review changes

985d50a

change another variable name

77e3789

add explanation of when to choose delta_byte_array

388ae86

Merge branch 'branch-24.04' into encode_delta_ba

8709b9c

fix the other bool assignment

6b2ef14

vuule self-requested a review February 1, 2024 21:18

vuule reviewed Feb 1, 2024

View reviewed changes

etseidl and others added 5 commits February 1, 2024 15:36

add delta binary test

46c5425

use struct rather than tuple for byte arrays

b955c37

Merge remote-tracking branch 'origin/branch-24.04' into encode_delta_ba

fb6385b

Merge branch 'branch-24.04' into encode_delta_ba

5645b8b

fix bug in delta_byte_array reader

858b4c9

vuule reviewed Feb 2, 2024

View reviewed changes

etseidl and others added 4 commits February 2, 2024 14:41

Apply suggestions from code review

cf30cae

Co-authored-by: Vukasin Milovanovic <[email protected]>

more suggestions

ad873c6

a few more cleanups

9883af7

lost a change somehow

81f4ee2

vuule approved these changes Feb 2, 2024

View reviewed changes

Merge branch 'branch-24.04' into encode_delta_ba

3e8767b

PointKernel approved these changes Feb 7, 2024

View reviewed changes

Merge branch 'rapidsai:branch-24.04' into encode_delta_ba

8378d64

shwina approved these changes Feb 7, 2024

View reviewed changes

bdice requested changes Feb 7, 2024

View reviewed changes

Merge branch 'branch-24.04' into encode_delta_ba

f0eccd0

vuule mentioned this pull request Feb 16, 2024

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

etseidl closed this Feb 16, 2024

etseidl mentioned this pull request Feb 16, 2024

Fix bugs in handling of delta encodings #15075

Merged

3 tasks

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024

etseidl mentioned this pull request Mar 6, 2024

Add DELTA_BYTE_ARRAY encoder for Parquet #15239

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

etseidl commented Jan 30, 2024

copy-pr-bot bot commented Jan 30, 2024

PointKernel commented Jan 31, 2024

vuule commented Jan 31, 2024

vuule left a comment

vuule left a comment

vuule Feb 1, 2024

etseidl Feb 1, 2024

etseidl Feb 1, 2024

vuule Feb 2, 2024

vuule Feb 1, 2024

vuule Feb 1, 2024

etseidl Feb 1, 2024

vuule left a comment

vuule Feb 2, 2024

etseidl Feb 2, 2024

vuule left a comment

vuule commented Feb 3, 2024

PointKernel left a comment

bdice left a comment •

edited

Loading

bdice Feb 7, 2024

etseidl commented Feb 16, 2024


		auto const type_id = s->col.leaf_column->type().id();

		auto const get_string_tuple = [type_id, s](int idx) -> thrust::pair<size_type, uint8_t const*> {

	prefer_dba = str_encoding == "DELTA_BYTE_ARRAY"
	prefer_delta_byte_array = str_encoding == "DELTA_BYTE_ARRAY"

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Encode DELTA_BYTE_ARRAY in Parquet writer #14938

Conversation

etseidl commented Jan 30, 2024

Description

Checklist

copy-pr-bot bot commented Jan 30, 2024

PointKernel commented Jan 31, 2024

vuule commented Jan 31, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vuule commented Feb 3, 2024

PointKernel left a comment

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

etseidl commented Feb 16, 2024

bdice left a comment •

edited

Loading