-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parameters to control row group size in Parquet writer #9677
Add parameters to control row group size in Parquet writer #9677
Conversation
…fea-parquet-rowgroup-size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice cleanup. Comments attached.
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commentary about page size was a helpful improvement. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just some coding-style comments for your reference.
|
||
/** | ||
* @brief Returns maximum stripe size, in bytes. | ||
*/ | ||
auto stripe_size_bytes() const { return _stripe_size_bytes; } | ||
auto get_stripe_size_bytes() const { return _stripe_size_bytes; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember a while back, @codereport and @jrhemstad had a long discussion on whether get
prefix should be used or not and the outcome is that no prefix is more readable IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true in general. However, we made a conscious decision to use get_
prefix for the _options
classes. I'm unable to find the discussion that led to this (perhaps Jake had the discussion with Conor later on).
We can choose to remove the prefix, but it should be done for all options. Either way, I prefer to keep the API consistent, so I'm adding these options with the prefix.
Co-authored-by: Yunsong Wang <[email protected]>
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9677 +/- ##
===============================================
Coverage ? 10.51%
===============================================
Files ? 118
Lines ? 20249
Branches ? 0
===============================================
Hits ? 2130
Misses ? 18119
Partials ? 0 Continue to review full report at Codecov.
|
@gpucibot merge |
This PR fixes #9615 Adds two more parameters to the Parquet writer options objects to control page sizes. One sets a target page size in bytes (defaults to 512 KiB), and a second to set a maximum number of rows per page (defaults to 20000, see [this](https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/) for the rationale behind this choice). ~~This also removes the validation logic (and unit tests) from the row_group_size setters, since the page size is no longer fixed. Perhaps validation could be moved to the build() function.~~ I've tried to follow the naming convention from #9677 Still needs help on python bindings and unit tests. Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #10882
Closes #9615
Adds the following API to the Parquet writer:
The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation.
Other changes:
rowgroup
torow_group
in APIs, since Parquet specs refer to this as "row group", not "rowgroup".uint32_t
use in Parquet writer.target_page_size
.