-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add partitioning support in parquet writer #9810
Add partitioning support in parquet writer #9810
Conversation
…en a previour write(table) call failed
Exception raised by writer needs to be same as that raised by pandas. If user_data is constructed earlier using pyarrow then the exception is raised early and is different
Please add description to the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a set of final nitpicks. Very happy with the PR overall.
Thank you for addressing the feedback to this extent (e.g. source_info
update).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, minor comment..
@gpucibot merge |
* @return this for chaining. | ||
*/ | ||
parquet_writer_options_builder& column_chunks_file_path(std::string file_path) | ||
parquet_writer_options_builder& column_chunks_file_paths(std::vector<std::string> file_paths) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change breaks compilation of benchmarks, which were not updated in this PR:
.column_chunks_file_path(file_path); |
We should ensure that CI compiles the benchmarks even if they don't run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the fix in #9776, just needs a small donation of one code review :P
This fixes a compilation error introduced in #9810. Tagging @devavret @vuule for review. Feel free to push to this PR with any fixes. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) URL: #9905
It also would have been nice to get a heads up that the java build broke |
Sorry about that 😞 . I presumed that spark team must've been notified if a PR had a |
This fixes the java build after #9810 went in. There is a lot of copy/paste in this first draft, because I just wanted to get something to work. Not sure if it is worth going back to make it common everywhere. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Jason Lowe (https://github.com/jlowe) URL: #9908
Makes use of the efficient partitioned writing support added in #9810 to improve performance of partitioned parquet dataset writing. Closes #5059 Authors: - Devavret Makkar (https://github.com/devavret) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) - Richard (Rick) Zamora (https://github.com/rjzamora) URL: #9971
Contributes to #5059
Adds libcudf support for writing partitioned datasets in parquet writer. With the new API, one can specify a vector of
{start_row, num_rows}
structs along with a table st slices of the input table gets written to the corresponding sink.Adds Multi-sink support in
sink_info