-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600
Merged
rapids-bot
merged 10 commits into
rapidsai:branch-24.06
from
etseidl:fixed_len_roundtrip
May 8, 2024
Merged
Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer #15600
rapids-bot
merged 10 commits into
rapidsai:branch-24.06
from
etseidl:fixed_len_roundtrip
May 8, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
CC @vuule |
vuule
added
feature request
New feature or request
cuIO
cuIO issue
non-breaking
Non-breaking change
labels
Apr 25, 2024
/ok to test |
vuule
reviewed
Apr 25, 2024
vuule
reviewed
Apr 25, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few small things. Looks great overall.
3 tasks
vuule
approved these changes
May 3, 2024
mhaseeb123
approved these changes
May 3, 2024
/ok to test |
/ok to test |
/merge |
rapids-bot bot
pushed a commit
that referenced
this pull request
May 22, 2024
…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
non-breaking
Non-breaking change
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
#13437 added the ability to consume FIXED_LEN_BYTE_ARRAY encoded data and represent it as lists of
UINT8
. When trying to write this data back to Parquet there are two problems. 1) the notion of fixed length is lost, and 2) theUINT8
data is written as a list ofINT32
which can quadruple the storage required. This PR addresses both issues by adding fields to the input and output metadata to allow for preserving the form of the original data.Checklist