Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600

etseidl · 2024-04-25T18:21:55Z

Description

#13437 added the ability to consume FIXED_LEN_BYTE_ARRAY encoded data and represent it as lists of UINT8. When trying to write this data back to Parquet there are two problems. 1) the notion of fixed length is lost, and 2) the UINT8 data is written as a list of INT32 which can quadruple the storage required. This PR addresses both issues by adding fields to the input and output metadata to allow for preserving the form of the original data.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-04-25T18:21:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2024-04-25T18:22:31Z

CC @vuule

vuule · 2024-04-25T19:03:09Z

/ok to test

cpp/src/io/parquet/reader_impl_helpers.cpp

vuule

just a few small things. Looks great overall.

cpp/src/io/parquet/writer_impl.cu

cpp/tests/io/parquet_writer_test.cpp

…ndtrip

vuule · 2024-05-03T19:09:26Z

/ok to test

…ndtrip

vuule · 2024-05-08T00:00:54Z

/ok to test

vuule · 2024-05-08T03:15:46Z

/merge

…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613

round trip fixed_len_byte_array data properly

9bb6c6a

etseidl requested a review from a team as a code owner April 25, 2024 18:21

etseidl requested review from mythrocks and shrshi April 25, 2024 18:21

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Apr 25, 2024

vuule requested review from mhaseeb123 and vuule April 25, 2024 19:02

vuule added feature request New feature or request cuIO cuIO issue non-breaking Non-breaking change labels Apr 25, 2024

vuule reviewed Apr 25, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Show resolved Hide resolved

vuule reviewed Apr 25, 2024

View reviewed changes

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/tests/io/parquet_writer_test.cpp Outdated Show resolved Hide resolved

etseidl and others added 3 commits April 25, 2024 20:42

Merge remote-tracking branch 'origin/branch-24.06' into fixed_len_rou…

3a3ea7b

…ndtrip

address review comments

b475084

Merge branch 'rapidsai:branch-24.06' into fixed_len_roundtrip

eae6b04

etseidl mentioned this pull request Apr 29, 2024

Expose some Parquet per-column configuration options via the python API #15613

Merged

3 tasks

etseidl added 3 commits April 30, 2024 13:09

Merge branch 'branch-24.06' into fixed_len_roundtrip

79a93c3

Merge branch 'rapidsai:branch-24.06' into fixed_len_roundtrip

e41cd04

Merge branch 'branch-24.06' into fixed_len_roundtrip

7934e31

etseidl requested a review from vuule May 3, 2024 15:29

vuule approved these changes May 3, 2024

View reviewed changes

mhaseeb123 approved these changes May 3, 2024

View reviewed changes

Merge branch 'branch-24.06' into fixed_len_roundtrip

64dc418

etseidl and others added 2 commits May 7, 2024 18:43

Merge remote-tracking branch 'origin/branch-24.06' into fixed_len_rou…

c59b090

…ndtrip

Merge branch 'branch-24.06' into fixed_len_roundtrip

ed3622e

rapids-bot bot merged commit 5f1f0dd into rapidsai:branch-24.06 May 8, 2024
70 checks passed

etseidl deleted the fixed_len_roundtrip branch May 8, 2024 05:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600

Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600

etseidl commented Apr 25, 2024

copy-pr-bot bot commented Apr 25, 2024

etseidl commented Apr 25, 2024

vuule commented Apr 25, 2024

vuule left a comment

vuule commented May 3, 2024

vuule commented May 8, 2024

vuule commented May 8, 2024

Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600

Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600

Conversation

etseidl commented Apr 25, 2024

Description

Checklist

copy-pr-bot bot commented Apr 25, 2024

etseidl commented Apr 25, 2024

vuule commented Apr 25, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule commented May 3, 2024

vuule commented May 8, 2024

vuule commented May 8, 2024