-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to request Parquet encodings on a per-column basis #15081
Merged
Merged
Changes from 9 commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
94d6401
add ability to request encodings per column
etseidl c77ecb1
remove out of date fixme
etseidl d5c2fdd
fix delta_byte_array case
etseidl dfe853b
clean up definition of is_use_delta
etseidl 5b4c368
Merge branch 'branch-24.04' into select_encodings
etseidl db821e2
Merge branch 'rapidsai:branch-24.04' into select_encodings
etseidl 650376b
Merge remote-tracking branch 'origin/branch-24.04' into select_encodings
etseidl 684ca61
Merge branch 'branch-24.04' into select_encodings
etseidl 1bf5c98
Merge branch 'branch-24.04' into select_encodings
etseidl 3445f59
refactor to use enum rather than strings for setting encodings
etseidl 64e3234
clean up leftover cruft
etseidl 67e85de
and a little more cruft
etseidl 3989641
clean up some boolean logic
etseidl c0f769f
Merge branch 'branch-24.04' into select_encodings
etseidl d5b451e
warn on DELTA_BYTE_ARRAY
etseidl 45a9edc
suggested changes to warnings
etseidl 4a3d434
Merge branch 'branch-24.04' into select_encodings
etseidl fc38013
Merge branch 'branch-24.04' into select_encodings
etseidl 1620728
Merge branch 'branch-24.04' into select_encodings
etseidl 0e3e8f5
Merge branch 'branch-24.04' into select_encodings
etseidl 72ff916
Merge branch 'branch-24.04' into select_encodings
etseidl 44473b0
catch corner case where dict encoding is requested, but cannot be used
etseidl 7201a9b
Merge branch 'branch-24.04' into select_encodings
etseidl 5405ddf
add test and refactor UserRequestedEncodings test
etseidl bf3ee39
Merge branch 'branch-24.04' into select_encodings
etseidl eb8631d
Merge branch 'branch-24.04' into select_encodings
vuule 9882133
change enum name per review comment
etseidl 810d951
Merge remote-tracking branch 'origin/branch-24.04' into select_encodings
etseidl ea63ec2
implement suggestion from review
etseidl 26794f0
Merge branch 'branch-24.04' into select_encodings
etseidl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note on my thinking here (or lack thereof)...I was thinking using strings to specify the desired encoding would be better than an enum since the
column_input_metadata
is shared between multiple encoders, and it would be more natural to use strings with a CLI or through the python interface. And if ORC has different encoding names, then we could add another set of constants for that.But a user interface could translate string values to an enum, and the enum could just add as many fields as necessary, some not relevant to one implementation or the other, so maybe this is silly. This acts like a scoped enum already, so I'm not opposed to switching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like this could be an enum :) I don't think a lot of code would change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I rushed a bit with this comment. Would this become just
encoding
and contain a superset of all encoding types?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably, or
page_encoding
maybe. But it would have to apply to all encoders that use the metadata (which I assume is just parquet and orc for now).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_encoding
isn't a great name for ORC...how aboutcolumn_encoding
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't believe I forgot my boy ORC.
sounds good!