Add partitioning support in parquet writer #9810

devavret · 2021-12-01T12:23:23Z

Contributes to #5059

Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of {start_row, num_rows} structs along with a table st slices of the input table gets written to the corresponding sink.
Adds Multi-sink support in sink_info

…en a previour write(table) call failed

…ith function

Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different

vuule · 2021-12-13T06:29:45Z

Please add description to the PR

vuule

Just a set of final nitpicks. Very happy with the PR overall.
Thank you for addressing the feedback to this extent (e.g. source_info update).

cpp/src/io/parquet/writer_impl.cu

cpp/include/cudf/io/types.hpp

vuule

🔥

galipremsagar

Looks good, minor comment..

python/cudf/cudf/_lib/orc.pyx

devavret · 2021-12-14T22:09:38Z

@gpucibot merge

bdice · 2021-12-15T00:54:56Z

cpp/include/cudf/io/parquet.hpp

   * @return this for chaining.
   */
-  parquet_writer_options_builder& column_chunks_file_path(std::string file_path)
+  parquet_writer_options_builder& column_chunks_file_paths(std::vector<std::string> file_paths)


This change breaks compilation of benchmarks, which were not updated in this PR:

cudf/cpp/benchmarks/io/parquet/parquet_writer_benchmark.cpp

Line 88 in fc2a32a

.column_chunks_file_path(file_path);

We should ensure that CI compiles the benchmarks even if they don't run.

I have the fix in #9776, just needs a small donation of one code review :P

@devavret

This fixes a compilation error introduced in #9810. Tagging @devavret @vuule for review. Feel free to push to this PR with any fixes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #9905

revans2 · 2021-12-15T11:47:18Z

It also would have been nice to get a heads up that the java build broke

devavret · 2021-12-15T11:50:53Z

It also would have been nice to get a heads up that the java build broke

Sorry about that 😞 . I presumed that spark team must've been notified if a PR had a breaking label.

This fixes the java build after #9810 went in. There is a lot of copy/paste in this first draft, because I just wanted to get something to work. Not sure if it is worth going back to make it common everywhere. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #9908

Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing. Closes #5059 Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9971

devavret added 13 commits October 23, 2021 03:26

First working version of partitioned write

7ca6570

Merge branch 'branch-22.02' into parq-partitioned-write

80e03a4

multiple sink API

21dc54b

partitions in write parquet API

d947abd

Fix a bug in frag causing incorrect num rows

360bf87

Merge branch 'branch-22.02' into parq-partitioned-write

942dd58

Dict encoding changes. Dict kernels now use frags

d454507

API cleanups

b2b44a6

Add a gtest and fix other tests by handling no partition case

0b6d33f

Add a guard to protect from an exception being thrown in impl dtor wh…

2beed73

…en a previour write(table) call failed

Add per-sink user_data in table_input_metadata

4e21e99

Cleanups in dict code and replace index translating while LIST loop w…

e0d1f33

…ith function

fix the returned metadata blob on close

54de724

devavret requested a review from vuule December 1, 2021 12:23

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 1, 2021

Revert to using meta ctor without user_data in pyx

aa45827

Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different

devavret added 4 - Needs cuIO Reviewer breaking Breaking change cuIO cuIO issue feature request New feature or request labels Dec 1, 2021

devavret marked this pull request as ready for review December 1, 2021 18:35

devavret requested review from a team as code owners December 1, 2021 18:35

devavret requested review from nvdbaranec, shwina and brandon-b-miller December 1, 2021 18:35

devavret added 2 commits December 2, 2021 01:16

Remove num_rows param and docs cleanup

06b2643

orc use table meta ctor with single user_data

fffb41e

devavret added 2 commits December 11, 2021 00:19

revert tests that were changed for debugging

be8c19a

Add empty df tests, make review changes

b9b5c15

devavret requested a review from vuule December 10, 2021 21:12

vuule requested changes Dec 13, 2021

View reviewed changes

devavret added 3 commits December 13, 2021 17:07

Review changes: reduce line size by aliasing the variable I keep using

1e79453

source/sink_info memeber privatisation

be83945

aggregate metadata privatisation

e537314

devavret requested a review from vuule December 13, 2021 12:12

Merge branch 'branch-22.02' into parq-partitioned-write

91580b4

vuule approved these changes Dec 13, 2021

View reviewed changes

Fix a private member access

c53fea7

galipremsagar reviewed Dec 14, 2021

View reviewed changes

python/cudf/cudf/_lib/orc.pyx Show resolved Hide resolved

galipremsagar approved these changes Dec 14, 2021

View reviewed changes

rapids-bot bot merged commit 41f9956 into rapidsai:branch-22.02 Dec 14, 2021

bdice reviewed Dec 15, 2021

View reviewed changes

bdice mentioned this pull request Dec 15, 2021

Fix compilation of benchmark for parquet writer. #9905

Merged

revans2 mentioned this pull request Dec 15, 2021

Fix the java build after parquet partitioning support #9908

Merged

devavret mentioned this pull request Jan 4, 2022

Use new efficient partitioned parquet writing in cuDF #9971

Merged

bdice mentioned this pull request Jan 24, 2022

Remove benchmarks suffix #10112

Merged

vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partitioning support in parquet writer #9810

Add partitioning support in parquet writer #9810

devavret commented Dec 1, 2021 •

edited

Loading

vuule commented Dec 13, 2021

vuule left a comment

vuule left a comment

galipremsagar left a comment

devavret commented Dec 14, 2021

bdice Dec 15, 2021

vuule Dec 15, 2021

revans2 commented Dec 15, 2021

devavret commented Dec 15, 2021

Add partitioning support in parquet writer #9810

Add partitioning support in parquet writer #9810

Conversation

devavret commented Dec 1, 2021 • edited Loading

vuule commented Dec 13, 2021

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

devavret commented Dec 14, 2021

bdice Dec 15, 2021

Choose a reason for hiding this comment

vuule Dec 15, 2021

Choose a reason for hiding this comment

revans2 commented Dec 15, 2021

devavret commented Dec 15, 2021

devavret commented Dec 1, 2021 •

edited

Loading