Add BYTE_STREAM_SPLIT support to Parquet #15311

etseidl · 2024-03-14T21:40:59Z

Description

Closes #15226. Part of #13501. Adds support for reading and writing BYTE_STREAM_SPLIT encoded Parquet data. Includes a "microkernel" version like those introduced by #15159.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…plit

copy-pr-bot · 2024-03-14T21:41:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

vuule · 2024-03-14T21:48:59Z

/ok to test

etseidl · 2024-03-14T21:56:46Z

BYTE_STREAM_SPLIT was originally just for floating point, but it is being expanded to all fixed-width types (apache/parquet-format#229). I did a quick comparison with some mixed double/integer data to demonstrate the utility of the new encoding. While the uncompressed size of the data matches PLAIN encoding, when coupled with compression BYTE_STREAM_SPLIT can outperform the other encoding options in terms of both speed and compressed data size. The one exception being uncompressed DELTA_BINARY_PACKED which achieves a data reduction comparable to ZSTD compression, for a far lower computational cost.

             dict   plain   delta    bss   delta (no comp)
decompress   214ms     52     217     26      0
decode         8ms      6      10      9     10
encode        13ms      6      13      7     17
compress     438ms    369     333    223      0
round trp   1155ms    800     972    611    465
uncomp sz     25MB     40      18     40     18
comp sz       16MB     15      16     14     18

mhaseeb123 · 2024-04-09T18:23:57Z

/ok to test

cpp/src/io/parquet/decode_fixed.cu

mhaseeb123

A few comments about the use of ts_scale in TIME_MILLIS type as seconds and days are encoded as millis in Parquet plus corresponding time units in tests.

cpp/tests/io/parquet_writer_test.cpp

cpp/src/io/parquet/page_data.cu

…plit

mhaseeb123 · 2024-04-10T00:39:28Z

Looks good to me. Thanks for the effort @etseidl

mhaseeb123 · 2024-04-10T00:40:01Z

/ok to test

mhaseeb123 · 2024-04-16T20:00:04Z

/ok to test

vuule

looks great, just a few small comments

cpp/src/io/parquet/page_data.cuh

vuule · 2024-04-17T02:43:56Z

cpp/src/io/parquet/page_enc.cu

+      is_split_stream ? Encoding::BYTE_STREAM_SPLIT
+                      : determine_encoding(
+                          s->page.page_type, physical_type, s->ck.use_dictionary, write_v2_headers);


optional: consider moving the is_split_stream check into determine_encoding. This way we have the entire logic in determine_encoding.

cpp/src/io/parquet/page_data.cu

cpp/src/io/parquet/decode_fixed.cu

…plit

vuule · 2024-04-17T19:19:11Z

/ok to test

vuule

🔥

vuule · 2024-04-18T21:30:40Z

/ok to test

…plit

vuule · 2024-04-24T16:53:56Z

/ok to test

vuule · 2024-04-24T19:26:04Z

/merge

…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832

etseidl and others added 18 commits March 13, 2024 14:33

initial cut

39defae

checkpoint

a2bf4c5

leave room for new microkernels

fec05e9

checkpoint

fe10804

checkpoint

16e961a

checkpoint

9de287f

formatting

8b75a6d

int and float working

760ca0c

get decimals working

cfa51e3

add more tests

832428d

clean up some dead code

55ed69c

update comment

9316f6c

only update cur ptr on t0

cba0a33

Merge remote-tracking branch 'origin/branch-24.04' into byte_stream_s…

6b991ac

…plit

rework kernel_mask_for_page

ee7919c

fix setting encoding on list children

d6f5569

add flat version of decoder

56365b5

Merge branch 'rapidsai:branch-24.04' into byte_stream_split

fb691bc

etseidl requested a review from a team as a code owner March 14, 2024 21:40

etseidl requested review from karthikeyann and vuule March 14, 2024 21:41

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 14, 2024

vuule added feature request New feature or request non-breaking Non-breaking change labels Mar 14, 2024

GregoryKimball mentioned this pull request Mar 15, 2024

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

GregoryKimball requested review from mhaseeb123 and nvdbaranec March 15, 2024 17:13

etseidl added 3 commits April 6, 2024 12:57

Merge branch 'branch-24.06' into byte_stream_split

1d2f395

Merge branch 'rapidsai:branch-24.06' into byte_stream_split

dc766e6

Merge branch 'branch-24.06' into byte_stream_split

9141101

mhaseeb123 reviewed Apr 9, 2024

View reviewed changes

cpp/src/io/parquet/decode_fixed.cu Show resolved Hide resolved

mhaseeb123 reviewed Apr 9, 2024

View reviewed changes

cpp/tests/io/parquet_writer_test.cpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/page_data.cu Show resolved Hide resolved

etseidl added 2 commits April 9, 2024 15:15

test more duration types and fix a small bug

6702bda

Merge remote-tracking branch 'origin/branch-24.06' into byte_stream_s…

69ac97d

…plit

mhaseeb123 approved these changes Apr 10, 2024

View reviewed changes

Merge branch 'rapidsai:branch-24.06' into byte_stream_split

d58ef9a

vuule reviewed Apr 17, 2024

View reviewed changes

etseidl and others added 3 commits April 17, 2024 14:21

address review comments

47ca0af

Merge remote-tracking branch 'origin/branch-24.06' into byte_stream_s…

2fd9dfd

…plit

Merge branch 'branch-24.06' into byte_stream_split

c329145

Merge branch 'rapidsai:branch-24.06' into byte_stream_split

25d2674

vuule approved these changes Apr 18, 2024

View reviewed changes

Merge branch 'branch-24.06' into byte_stream_split

6198644

etseidl and others added 2 commits April 23, 2024 09:11

Merge branch 'branch-24.06' into byte_stream_split

daeb886

Merge remote-tracking branch 'origin/branch-24.06' into byte_stream_s…

d147c70

…plit

rapids-bot bot merged commit 117eff6 into rapidsai:branch-24.06 Apr 24, 2024
71 checks passed

etseidl deleted the byte_stream_split branch April 24, 2024 19:31

etseidl mentioned this pull request May 22, 2024

Add test of interoperability of cuDF and arrow BYTE_STREAM_SPLIT encoders #15832

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BYTE_STREAM_SPLIT support to Parquet #15311

Add BYTE_STREAM_SPLIT support to Parquet #15311

etseidl commented Mar 14, 2024

copy-pr-bot bot commented Mar 14, 2024

vuule commented Mar 14, 2024

etseidl commented Mar 14, 2024

mhaseeb123 commented Apr 9, 2024

mhaseeb123 left a comment

mhaseeb123 commented Apr 10, 2024 •

edited

Loading

mhaseeb123 commented Apr 10, 2024

mhaseeb123 commented Apr 16, 2024

vuule left a comment

vuule Apr 17, 2024

vuule commented Apr 17, 2024

vuule left a comment

vuule commented Apr 18, 2024

vuule commented Apr 24, 2024

vuule commented Apr 24, 2024

Add BYTE_STREAM_SPLIT support to Parquet #15311

Add BYTE_STREAM_SPLIT support to Parquet #15311

Conversation

etseidl commented Mar 14, 2024

Description

Checklist

copy-pr-bot bot commented Mar 14, 2024

vuule commented Mar 14, 2024

etseidl commented Mar 14, 2024

mhaseeb123 commented Apr 9, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Apr 10, 2024 • edited Loading

mhaseeb123 commented Apr 10, 2024

mhaseeb123 commented Apr 16, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule Apr 17, 2024

Choose a reason for hiding this comment

vuule commented Apr 17, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule commented Apr 18, 2024

vuule commented Apr 24, 2024

vuule commented Apr 24, 2024

mhaseeb123 commented Apr 10, 2024 •

edited

Loading