Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add partitioning support in parquet writer #9810

Merged
merged 38 commits into from
Dec 14, 2021

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Dec 1, 2021

Contributes to #5059

Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of {start_row, num_rows} structs along with a table st slices of the input table gets written to the corresponding sink.
Adds Multi-sink support in sink_info

@devavret devavret requested a review from vuule December 1, 2021 12:23
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Dec 1, 2021
Exception raised by writer needs to be same as that raised by pandas.
If user_data is constructed earlier using pyarrow then the exception is
raised early and is different
@devavret devavret added 4 - Needs cuIO Reviewer breaking Breaking change cuIO cuIO issue feature request New feature or request labels Dec 1, 2021
@devavret devavret marked this pull request as ready for review December 1, 2021 18:35
@devavret devavret requested review from a team as code owners December 1, 2021 18:35
@devavret devavret requested a review from vuule December 10, 2021 21:12
@vuule
Copy link
Contributor

vuule commented Dec 13, 2021

Please add description to the PR

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a set of final nitpicks. Very happy with the PR overall.
Thank you for addressing the feedback to this extent (e.g. source_info update).

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
@devavret devavret requested a review from vuule December 13, 2021 12:12
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, minor comment..

python/cudf/cudf/_lib/orc.pyx Show resolved Hide resolved
@devavret
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 41f9956 into rapidsai:branch-22.02 Dec 14, 2021
* @return this for chaining.
*/
parquet_writer_options_builder& column_chunks_file_path(std::string file_path)
parquet_writer_options_builder& column_chunks_file_paths(std::vector<std::string> file_paths)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change breaks compilation of benchmarks, which were not updated in this PR:

.column_chunks_file_path(file_path);

We should ensure that CI compiles the benchmarks even if they don't run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the fix in #9776, just needs a small donation of one code review :P

rapids-bot bot pushed a commit that referenced this pull request Dec 15, 2021
This fixes a compilation error introduced in #9810. Tagging @devavret @vuule for review. Feel free to push to this PR with any fixes.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #9905
@revans2
Copy link
Contributor

revans2 commented Dec 15, 2021

It also would have been nice to get a heads up that the java build broke

@devavret
Copy link
Contributor Author

It also would have been nice to get a heads up that the java build broke

Sorry about that 😞 . I presumed that spark team must've been notified if a PR had a breaking label.

rapids-bot bot pushed a commit that referenced this pull request Dec 15, 2021
This fixes the java build after #9810 went in.

There is a lot of copy/paste in this first draft, because I just wanted to get something to work. Not sure if it is worth going back to make it common everywhere.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #9908
rapids-bot bot pushed a commit that referenced this pull request Jan 10, 2022
Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing.

Closes #5059

Authors:
  - Devavret Makkar (https://github.com/devavret)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Richard (Rick) Zamora (https://github.com/rjzamora)

URL: #9971
@bdice bdice mentioned this pull request Jan 24, 2022
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond breaking Breaking change cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants