-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BYTE_STREAM_SPLIT support to Parquet #15311
Conversation
/ok to test |
|
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments about the use of ts_scale
in TIME_MILLIS type as seconds and days are encoded as millis in Parquet plus corresponding time units in tests.
Looks good to me. Thanks for the effort @etseidl |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, just a few small comments
cpp/src/io/parquet/page_enc.cu
Outdated
is_split_stream ? Encoding::BYTE_STREAM_SPLIT | ||
: determine_encoding( | ||
s->page.page_type, physical_type, s->ck.use_dictionary, write_v2_headers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional: consider moving the is_split_stream check into determine_encoding
. This way we have the entire logic in determine_encoding
.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
/ok to test |
/ok to test |
/merge |
…ders (#15832) BYTE_STREAM_SPLIT encoding was recently added to cuDF (#15311). The Parquet specification was recently changed (apache/parquet-format#229) to extend the datatypes that can be encoded as BYTE_STREAM_SPLIT, and this was only recently implemented in arrow (apache/arrow#40094). This PR adds a check that cuDF and arrow can produce compatible files using BYTE_STREAM_SPLIT encoding. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15832
Description
Closes #15226. Part of #13501. Adds support for reading and writing
BYTE_STREAM_SPLIT
encoded Parquet data. Includes a "microkernel" version like those introduced by #15159.Checklist