Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

etseidl · 2023-12-06T22:16:40Z

Description

Part of #13501. This adds the ability to read and write Parquet pages with DELTA_LENGTH_BYTE_ARRAY encoding.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2023-12-06T22:16:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2023-12-06T22:26:29Z

Unlike past PRs, in this PR I've done both the encoder and decoder. One reason is that the changes here are smaller than in previous delta work. A better reason is that by adding both, I can test the reader and writer at the same time without having to rely on python tests.

That said, this PR does expand the existing python tests anyway...perhaps too much. Reviewers, please let me know if the delta tests in test_parquet.py are a little overboard.

A note on the encoding itself. DELTA_LENGTH_BYTE_ARRAY is much like PLAIN encoding for strings. The difference is that the lengths are encoded first, and then the string data. The theory is that separating the two will lead to better compression. The other difference is that the lengths are DELTA_BINARY_PACKED encoded, so there should be a space savings.

Because the string data is in one big chunk per page, the processing is much simpler than with DELTA_BYTE_ARRAY or even PLAIN. On the decode side, all we need to do is decode the length data into the offsets child column, and then do an exclusive scan on this to turn lengths into offsets. The character data can just be parallel memcpy'd straight into the chars child column. On the encoding side, it's a similar story. First encode the length data, and then memcpy the chars data into the page buffer. Of course, skip_rows complicates things some, but not too much.

vuule · 2023-12-07T00:06:15Z

/ok to test

vuule

full pass!

cpp/src/io/parquet/page_delta_decode.cu

cpp/src/io/parquet/page_enc.cu

cpp/src/io/parquet/page_string_decode.cu

Co-authored-by: Vukasin Milovanovic <[email protected]>

ttnghia · 2023-12-18T23:12:11Z

/ok to test

vuule · 2023-12-19T21:24:54Z

/ok to test

vuule

🔥 🔥

isVoid

One small nitpick. Non-blocking.

isVoid · 2023-12-20T15:58:01Z

python/cudf/cudf/tests/test_parquet.py

@@ -1352,8 +1352,13 @@ def test_delta_binary(nrows, add_nulls, dtype, tmpdir):

 @pytest.mark.parametrize("nrows", delta_num_rows())
 @pytest.mark.parametrize("add_nulls", [True, False])
-@pytest.mark.parametrize("str_encoding", ["DELTA_BYTE_ARRAY"])
-def test_delta_byte_array_roundtrip(nrows, add_nulls, str_encoding, tmpdir):
+@pytest.mark.parametrize("string_len", [12, 48, 96, 128])


Since this parameter is used for max_string_length below, would suggest keep the name consistent here.

vuule · 2023-12-20T17:53:43Z

/ok to test

vuule · 2023-12-20T20:02:13Z

/merge

etseidl added 6 commits December 6, 2023 13:16

add delta length byte array encoder/decoder

a95259f

change encoding in file

27dfcdd

rename some things

73573f3

a few cleanups

b454b76

Merge remote-tracking branch 'origin/branch-24.02' into dlba_enc_dec

4821128

finish merge of size statistics

fb50fcb

etseidl requested review from a team as code owners December 6, 2023 22:16

etseidl requested review from isVoid, charlesbluca, PointKernel and divyegala December 6, 2023 22:16

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Dec 6, 2023

vuule self-requested a review December 7, 2023 00:05

vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Dec 7, 2023

etseidl and others added 9 commits December 8, 2023 10:07

Merge branch 'rapidsai:branch-24.02' into dlba_enc_dec

394fe81

Merge remote-tracking branch 'origin/branch-24.02' into dlba_enc_dec

a2181bc

add syncthreads

07af187

Merge branch 'rapidsai:branch-24.02' into dlba_enc_dec

fae7388

Merge branch 'branch-24.02' into dlba_enc_dec

aa94791

Merge branch 'branch-24.02' into dlba_enc_dec

f806a48

change unsupported encoding to 15

080e89d

Merge branch 'dlba_enc_dec' of github.com:etseidl/cudf into dlba_enc_dec

0827fd3

Merge branch 'branch-24.02' into dlba_enc_dec

7e9d752

etseidl and others added 2 commits December 14, 2023 17:07

change skip_values_and_sum to run on a single warp

ad42470

Merge branch 'branch-24.02' into dlba_enc_dec

c552ffb

vuule reviewed Dec 15, 2023

View reviewed changes

etseidl and others added 7 commits December 15, 2023 15:13

implement suggestion from review

ace5be3

Co-authored-by: Vukasin Milovanovic <[email protected]>

move delta char len calculation

3523a36

a few cleanups

c028f2a

remove some outdated TODOs and superfluous threadfences

4c2dc56

handle non-string byte arrays

c07bea2

parquet-mr does not like duplicate column names

62f8b4a

Merge branch 'branch-24.02' into dlba_enc_dec

36478bf

vuule self-requested a review December 19, 2023 19:38

etseidl added 5 commits December 19, 2023 12:00

fix for writing all-null column

b4e6999

fix for reading single null row

385dce1

Merge remote-tracking branch 'origin/branch-24.02' into dlba_enc_dec

edd3c13

finish merge

6a25a6a

make sure header is written if all values are null

0d0c95f

vuule approved these changes Dec 19, 2023

View reviewed changes

add extra delta tests

e0d8cf1

isVoid approved these changes Dec 20, 2023

View reviewed changes

etseidl and others added 3 commits December 20, 2023 08:17

change param name to match use

a65f512

Merge branch 'branch-24.02' into dlba_enc_dec

b77b9ee

try 2 at consistent naming

fc282fe

ttnghia approved these changes Dec 20, 2023

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Dec 20, 2023

rapids-bot bot merged commit f1ff424 into rapidsai:branch-24.02 Dec 20, 2023
67 checks passed

etseidl deleted the dlba_enc_dec branch December 20, 2023 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

etseidl commented Dec 6, 2023 •

edited

Loading

copy-pr-bot bot commented Dec 6, 2023

etseidl commented Dec 6, 2023

vuule commented Dec 7, 2023

vuule left a comment

ttnghia commented Dec 18, 2023

vuule commented Dec 19, 2023

vuule left a comment

isVoid left a comment •

edited

Loading

isVoid Dec 20, 2023

vuule commented Dec 20, 2023

vuule commented Dec 20, 2023

Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

Add DELTA_LENGTH_BYTE_ARRAY encoder and decoder for Parquet #14590

Conversation

etseidl commented Dec 6, 2023 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 6, 2023

etseidl commented Dec 6, 2023

vuule commented Dec 7, 2023

vuule left a comment

Choose a reason for hiding this comment

ttnghia commented Dec 18, 2023

vuule commented Dec 19, 2023

vuule left a comment

Choose a reason for hiding this comment

isVoid left a comment • edited Loading

Choose a reason for hiding this comment

isVoid Dec 20, 2023

Choose a reason for hiding this comment

vuule commented Dec 20, 2023

vuule commented Dec 20, 2023

etseidl commented Dec 6, 2023 •

edited

Loading

isVoid left a comment •

edited

Loading