Add ability to request Parquet encodings on a per-column basis #15081

etseidl · 2024-02-16T23:38:53Z

Description

Allows users to request specific page encodings to use on a column-by-column basis. This is accomplished by adding an encoding property to the column_input_metadata struct. This is a necessary change before adding DELTA_BYTE_ARRAY encoding.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-02-16T23:38:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2024-02-17T19:08:13Z

cpp/include/cudf/io/types.hpp

+/**
+ * @brief Valid parquet encodings for use with `column_in_metadata::set_encoding()`
+ */
+struct parquet_encoding {


Note on my thinking here (or lack thereof)...I was thinking using strings to specify the desired encoding would be better than an enum since the column_input_metadata is shared between multiple encoders, and it would be more natural to use strings with a CLI or through the python interface. And if ORC has different encoding names, then we could add another set of constants for that.

But a user interface could translate string values to an enum, and the enum could just add as many fields as necessary, some not relevant to one implementation or the other, so maybe this is silly. This acts like a scoped enum already, so I'm not opposed to switching.

It does seem like this could be an enum :) I don't think a lot of code would change.

Actually I rushed a bit with this comment. Would this become just encoding and contain a superset of all encoding types?

Yeah, probably, or page_encoding maybe. But it would have to apply to all encoders that use the metadata (which I assume is just parquet and orc for now).

page_encoding isn't a great name for ORC...how about column_encoding?

Can't believe I forgot my boy ORC.
sounds good!

cpp/src/io/parquet/writer_impl.cu

cpp/src/io/parquet/parquet_common.hpp

vuule

Suggested expanded logging to ensure we never silently ignore a requested encoding.

cpp/src/io/parquet/writer_impl.cu

etseidl · 2024-03-05T00:44:20Z

Test checked in...but last CI run failed style check because of some python?

vuule · 2024-03-05T01:14:20Z

Test checked in...but last CI run failed style check because of some python?

Yeah, that's a new one. I think there was a two-hour window when CI passed 💀

vuule · 2024-03-05T03:36:12Z

/ok to test

cpp/include/cudf/io/types.hpp

ttnghia · 2024-03-05T19:04:42Z

cpp/include/cudf/io/types.hpp

@@ -585,6 +605,7 @@ class column_in_metadata {
  std::optional<uint8_t> _decimal_precision;
  std::optional<int32_t> _parquet_field_id;
  std::vector<column_in_metadata> children;
+  column_encoding _encoding = column_encoding::NOT_SET;


Suggested change

column_encoding _encoding = column_encoding::NOT_SET;

column_encoding _encoding{column_encoding::NOT_SET};

Nothing else in the struct (or AFAICT in this file) uses an initializer list...maybe clean this up separately?

PointKernel

Looks great. Feel free to merge when having the second approval.

PointKernel · 2024-03-05T20:10:10Z

/ok to test

cpp/include/cudf/io/types.hpp

ttnghia

Some nits but otherwise LGTM.

GregoryKimball · 2024-03-05T22:18:30Z

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. In pandas, to_parquet with the pyarrow engine supports the same column_encoding parameter.

etseidl · 2024-03-05T22:22:32Z

Perhaps not in this PR, but we will ultimately want to expose this in cuDF-python. pyarrow supports the same column_encoding parameter.

That was my original plan, but that's a heavier lift and I just wanted the bare minimum to at least be able to test new encoders. The pain point with how pyarrow does it is knowing in advance what the column names will be, esp for nested (see #14539 for instance).

Co-authored-by: Nghia Truong <[email protected]>

vuule · 2024-03-05T22:42:01Z

/ok to test

PointKernel · 2024-03-06T06:59:17Z

/merge

Re-submission of #14938. Final (delta) piece of #13501. Adds the ability to encode Parquet pages as DELTA_BYTE_ARRAY. Python testing wlll be added as a follow-on when per-column encoding selection is added to the python API (ref this [comment](#15081 (comment))). Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) URL: #15239

…15411) #15081 added the ability to select per-column encodings in the Parquet writer. Some Parquet encodings (e.g `DELTA_BINARY_PACKED`) do not mix well with compression (see [PARQUET-2414](https://issues.apache.org/jira/browse/PARQUET-2414) for example). This PR adds the ability to turn off compression for select columns. This uses the same mechanism as encoding selection, so an example use would be: ```c++ cudf::io::table_input_metadata table_metadata(table); table_metadata.column_metadata[0] .set_name("int_delta_binary") .set_encoding(cudf::io::column_encoding::DELTA_BINARY_PACKED) .set_skip_compression(true); ``` Authors: - Ed Seidl (https://github.com/etseidl) - Bradley Dice (https://github.com/bdice) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Bradley Dice (https://github.com/bdice) URL: #15411

…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613

add ability to request encodings per column

94d6401

etseidl requested a review from a team as a code owner February 16, 2024 23:38

etseidl requested review from harrism and ttnghia February 16, 2024 23:38

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 16, 2024

etseidl added 2 commits February 16, 2024 17:07

remove out of date fixme

c77ecb1

fix delta_byte_array case

d5c2fdd

etseidl commented Feb 17, 2024

View reviewed changes

etseidl and others added 2 commits February 19, 2024 15:00

clean up definition of is_use_delta

dfe853b

Merge branch 'branch-24.04' into select_encodings

5b4c368

harrism removed their request for review February 21, 2024 04:09

etseidl and others added 3 commits February 21, 2024 15:03

Merge branch 'rapidsai:branch-24.04' into select_encodings

db821e2

Merge remote-tracking branch 'origin/branch-24.04' into select_encodings

650376b

Merge branch 'branch-24.04' into select_encodings

684ca61

GregoryKimball requested a review from vuule February 26, 2024 18:05

Merge branch 'branch-24.04' into select_encodings

1bf5c98

vuule reviewed Feb 26, 2024

View reviewed changes

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/parquet_common.hpp Outdated Show resolved Hide resolved

etseidl and others added 6 commits February 27, 2024 10:41

refactor to use enum rather than strings for setting encodings

3445f59

clean up leftover cruft

64e3234

and a little more cruft

67e85de

clean up some boolean logic

3989641

Merge branch 'branch-24.04' into select_encodings

c0f769f

warn on DELTA_BYTE_ARRAY

d5b451e

vuule added feature request New feature or request non-breaking Non-breaking change labels Feb 27, 2024

vuule self-requested a review February 27, 2024 20:14

vuule reviewed Feb 27, 2024

View reviewed changes

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

etseidl and others added 2 commits February 27, 2024 12:58

suggested changes to warnings

45a9edc

Merge branch 'branch-24.04' into select_encodings

4a3d434

add test and refactor UserRequestedEncodings test

5405ddf

Merge branch 'branch-24.04' into select_encodings

bf3ee39

Merge branch 'branch-24.04' into select_encodings

eb8631d

ttnghia reviewed Mar 5, 2024

View reviewed changes

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

ttnghia reviewed Mar 5, 2024

View reviewed changes

PointKernel approved these changes Mar 5, 2024

View reviewed changes

etseidl added 2 commits March 5, 2024 11:46

change enum name per review comment

9882133

Merge remote-tracking branch 'origin/branch-24.04' into select_encodings

810d951

vuule requested a review from ttnghia March 5, 2024 22:08

ttnghia reviewed Mar 5, 2024

View reviewed changes

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

ttnghia approved these changes Mar 5, 2024

View reviewed changes

etseidl and others added 2 commits March 5, 2024 14:23

implement suggestion from review

ea63ec2

Co-authored-by: Nghia Truong <[email protected]>

Merge branch 'branch-24.04' into select_encodings

26794f0

rapids-bot bot merged commit dbf7236 into rapidsai:branch-24.04 Mar 6, 2024
74 checks passed

etseidl deleted the select_encodings branch March 6, 2024 15:57

etseidl mentioned this pull request Mar 6, 2024

Add DELTA_BYTE_ARRAY encoder for Parquet #15239

Merged

3 tasks

GregoryKimball mentioned this pull request Mar 6, 2024

[FEA] Support V2 encodings in Parquet reader and writer #13501

Closed

etseidl mentioned this pull request Mar 28, 2024

Add option to Parquet writer to skip compressing individual columns #15411

Merged

3 tasks

etseidl mentioned this pull request Apr 29, 2024

Expose some Parquet per-column configuration options via the python API #15613

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to request Parquet encodings on a per-column basis #15081

Add ability to request Parquet encodings on a per-column basis #15081

etseidl commented Feb 16, 2024

copy-pr-bot bot commented Feb 16, 2024

etseidl Feb 17, 2024

vuule Feb 26, 2024

vuule Feb 26, 2024

etseidl Feb 26, 2024

etseidl Feb 27, 2024

vuule Feb 27, 2024

vuule left a comment

etseidl commented Mar 5, 2024

vuule commented Mar 5, 2024

vuule commented Mar 5, 2024

ttnghia Mar 5, 2024

etseidl Mar 5, 2024

PointKernel left a comment

PointKernel commented Mar 5, 2024

ttnghia left a comment

GregoryKimball commented Mar 5, 2024 •

edited

Loading

etseidl commented Mar 5, 2024

vuule commented Mar 5, 2024

PointKernel commented Mar 6, 2024

	column_encoding _encoding = column_encoding::NOT_SET;
	column_encoding _encoding{column_encoding::NOT_SET};

Add ability to request Parquet encodings on a per-column basis #15081

Add ability to request Parquet encodings on a per-column basis #15081

Conversation

etseidl commented Feb 16, 2024

Description

Checklist

copy-pr-bot bot commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

etseidl commented Mar 5, 2024

vuule commented Mar 5, 2024

vuule commented Mar 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

PointKernel commented Mar 5, 2024

ttnghia left a comment

Choose a reason for hiding this comment

GregoryKimball commented Mar 5, 2024 • edited Loading

etseidl commented Mar 5, 2024

vuule commented Mar 5, 2024

PointKernel commented Mar 6, 2024

GregoryKimball commented Mar 5, 2024 •

edited

Loading