Add option to Parquet writer to skip compressing individual columns #15411

etseidl · 2024-03-28T16:46:35Z

Description

#15081 added the ability to select per-column encodings in the Parquet writer. Some Parquet encodings (e.g DELTA_BINARY_PACKED) do not mix well with compression (see PARQUET-2414 for example). This PR adds the ability to turn off compression for select columns. This uses the same mechanism as encoding selection, so an example use would be:

  cudf::io::table_input_metadata table_metadata(table);
  table_metadata.column_metadata[0]
    .set_name("int_delta_binary")
    .set_encoding(cudf::io::column_encoding::DELTA_BINARY_PACKED)
    .set_skip_compression(true);

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-03-28T16:46:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

etseidl · 2024-03-28T16:47:22Z

cc @GregoryKimball

mhaseeb123

Looks good. Thank you for the effort!

mhaseeb123 · 2024-04-08T18:50:56Z

/ok to test

bdice

This looks fine to me. @vuule Do you want to take a look before merging? If not, feel free to merge as-is.

bdice · 2024-04-18T19:35:00Z

/ok to test

vuule

What is this, a stealth feature? Not even a CC!? :(

Cool stuff, looks good, just have a few questions.

cpp/tests/io/parquet_writer_test.cpp

vuule · 2024-04-18T21:52:19Z

cpp/tests/io/parquet_writer_test.cpp

+  cudf::io::parquet::detail::FileMetaData fmd;
+  read_footer(source, &fmd);
+
+  EXPECT_EQ(fmd.row_groups[0].columns[0].meta_data.codec, cudf::io::parquet::detail::UNCOMPRESSED);


this can fail if compressed size is, by chance, larger than uncompressed?

hmm...not that line, but the line below could 😟

Right, the other one.
Still, it shouldn't change randomly. The way the values are encoded and the way the compression works should be very stable.

cpp/src/io/parquet/page_enc.cu

vuule · 2024-04-18T23:50:22Z

/merge

…PI (#15613) Several recent PRs (#15081, #15411, #15600) added the ability to control some aspects of Parquet file writing on a per-column basis. During discussion of #15081 it was [suggested](#15081 (comment)) that these options be exposed by cuDF-python in a manner similar to pyarrow. This PR adds the ability to control per-column encoding, compression, binary output, and fixed-length data width, using fully qualified Parquet column names. For example, given a cuDF table with an integer column 'a', and a `list<int32>` column 'b', the fully qualified column names would be 'a' and 'b.list.element'. Addresses "Add cuDF-python API support for specifying encodings" task in #13501. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #15613

etseidl and others added 5 commits March 27, 2024 22:15

initial pass

380437b

formatting

199260c

add skip_compression

0c8fd7d

add todo

0f9c8f6

add test

6aaebd7

etseidl requested a review from a team as a code owner March 28, 2024 16:46

etseidl requested review from bdice and mhaseeb123 March 28, 2024 16:46

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 28, 2024

mhaseeb123 reviewed Apr 1, 2024

View reviewed changes

mhaseeb123 approved these changes Apr 1, 2024

View reviewed changes

etseidl added 5 commits April 1, 2024 14:16

Merge branch 'branch-24.06' into select_comp

6f62d7d

Merge branch 'branch-24.06' into select_comp

1560b89

Merge branch 'branch-24.06' into select_comp

81db048

Merge branch 'branch-24.06' into select_comp

c77bc59

Merge branch 'rapidsai:branch-24.06' into select_comp

416b31f

etseidl added 6 commits April 9, 2024 12:59

Merge branch 'branch-24.06' into select_comp

4e463aa

Merge branch 'branch-24.06' into select_comp

d5e5974

Merge branch 'branch-24.06' into select_comp

8e7e86f

Merge branch 'branch-24.06' into select_comp

f1eeebe

Merge branch 'branch-24.06' into select_comp

7c9e67d

Merge branch 'branch-24.06' into select_comp

6859438

bdice approved these changes Apr 18, 2024

View reviewed changes

bdice added feature request New feature or request non-breaking Non-breaking change labels Apr 18, 2024

Merge branch 'branch-24.06' into select_comp

93d1a9e

bdice assigned etseidl Apr 18, 2024

bdice added this to the Parquet continuous improvement milestone Apr 18, 2024

vuule reviewed Apr 18, 2024

View reviewed changes

vuule added the cuIO cuIO issue label Apr 18, 2024

rapids-bot bot merged commit e0c4280 into rapidsai:branch-24.06 Apr 18, 2024
75 checks passed

etseidl deleted the select_comp branch April 19, 2024 00:19

etseidl mentioned this pull request Apr 29, 2024

Expose some Parquet per-column configuration options via the python API #15613

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to Parquet writer to skip compressing individual columns #15411

Add option to Parquet writer to skip compressing individual columns #15411

etseidl commented Mar 28, 2024

copy-pr-bot bot commented Mar 28, 2024

etseidl commented Mar 28, 2024

mhaseeb123 left a comment

mhaseeb123 commented Apr 8, 2024

bdice left a comment

bdice commented Apr 18, 2024

vuule left a comment

vuule Apr 18, 2024

etseidl Apr 18, 2024

vuule Apr 18, 2024

vuule commented Apr 18, 2024

Add option to Parquet writer to skip compressing individual columns #15411

Add option to Parquet writer to skip compressing individual columns #15411

Conversation

etseidl commented Mar 28, 2024

Description

Checklist

copy-pr-bot bot commented Mar 28, 2024

etseidl commented Mar 28, 2024

mhaseeb123 left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Apr 8, 2024

bdice left a comment

Choose a reason for hiding this comment

bdice commented Apr 18, 2024

vuule left a comment

Choose a reason for hiding this comment

vuule Apr 18, 2024

Choose a reason for hiding this comment

etseidl Apr 18, 2024

Choose a reason for hiding this comment

vuule Apr 18, 2024

Choose a reason for hiding this comment

vuule commented Apr 18, 2024