-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to request Parquet encodings on a per-column basis #15081
Conversation
cpp/include/cudf/io/types.hpp
Outdated
/** | ||
* @brief Valid parquet encodings for use with `column_in_metadata::set_encoding()` | ||
*/ | ||
struct parquet_encoding { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note on my thinking here (or lack thereof)...I was thinking using strings to specify the desired encoding would be better than an enum since the column_input_metadata
is shared between multiple encoders, and it would be more natural to use strings with a CLI or through the python interface. And if ORC has different encoding names, then we could add another set of constants for that.
But a user interface could translate string values to an enum, and the enum could just add as many fields as necessary, some not relevant to one implementation or the other, so maybe this is silly. This acts like a scoped enum already, so I'm not opposed to switching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem like this could be an enum :) I don't think a lot of code would change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I rushed a bit with this comment. Would this become just encoding
and contain a superset of all encoding types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably, or page_encoding
maybe. But it would have to apply to all encoders that use the metadata (which I assume is just parquet and orc for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_encoding
isn't a great name for ORC...how about column_encoding
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't believe I forgot my boy ORC.
sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested expanded logging to ensure we never silently ignore a requested encoding.
Test checked in...but last CI run failed style check because of some python? |
Yeah, that's a new one. I think there was a two-hour window when CI passed 💀 |
/ok to test |
cpp/include/cudf/io/types.hpp
Outdated
@@ -585,6 +605,7 @@ class column_in_metadata { | |||
std::optional<uint8_t> _decimal_precision; | |||
std::optional<int32_t> _parquet_field_id; | |||
std::vector<column_in_metadata> children; | |||
column_encoding _encoding = column_encoding::NOT_SET; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column_encoding _encoding = column_encoding::NOT_SET; | |
column_encoding _encoding{column_encoding::NOT_SET}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing else in the struct (or AFAICT in this file) uses an initializer list...maybe clean this up separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Feel free to merge when having the second approval.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits but otherwise LGTM.
Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. In pandas, |
That was my original plan, but that's a heavier lift and I just wanted the bare minimum to at least be able to test new encoders. The pain point with how pyarrow does it is knowing in advance what the column names will be, esp for nested (see #14539 for instance). |
Co-authored-by: Nghia Truong <[email protected]>
/ok to test |
/merge |
Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239
…15411) #15081 added the ability to select per-column encodings in the Parquet writer. Some Parquet encodings (e.g `DELTA_BINARY_PACKED`) do not mix well with compression (see [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414) for example). This PR adds the ability to turn off compression for select columns. This uses the same mechanism as encoding selection, so an example use would be: ```c++ cudf::io::table_input_metadata table_metadata(table); table_metadata.column_metadata[0] .set_name("int_delta_binary") .set_encoding(cudf::io::column_encoding::DELTA_BINARY_PACKED) .set_skip_compression(true); ``` Authors: - Ed Seidl (https://github.com/etseidl) - Bradley Dice (https://github.com/bdice) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) URL: #15411
…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613
Description
Allows users to request specific page encodings to use on a column-by-column basis. This is accomplished by adding an
encoding
property to thecolumn_input_metadata
struct. This is a necessary change before addingDELTA_BYTE_ARRAY
encoding.Checklist