-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Control rowgroup and page size when writing Parquet files #9615
Comments
Closes #9615 Adds the following API to the Parquet writer: - Set maximum row group size, in bytes (minimum of 512KB); - Set maximum row group size, in rows (minimum of 5000). The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation. Other changes: - Fix naming in some ORC APIs to be consistent. - Change `rowgroup` to `row_group` in APIs, since Parquet specs refer to this as "row group", not "rowgroup". - Replace some `uint32_t` use in Parquet writer. - Remove unused `target_page_size`. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) - Ashwin Srinath (https://github.com/shwina) URL: #9677
Reopening - page size control has not been implemented. |
This issue has been labeled |
Hi @vuule. Controlling page size is a feature of great interest to my group. I've taken a stab at implementing this functionality...would it be useful to submit a PR, or do you already have an implementation in mind and just need the time/priority to get it implemented? |
@etseidl Sure, go ahead. |
This PR fixes #9615 Adds two more parameters to the Parquet writer options objects to control page sizes. One sets a target page size in bytes (defaults to 512 KiB), and a second to set a maximum number of rows per page (defaults to 20000, see [this](https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/) for the rationale behind this choice). ~~This also removes the validation logic (and unit tests) from the row_group_size setters, since the page size is no longer fixed. Perhaps validation could be moved to the build() function.~~ I've tried to follow the naming convention from #9677 Still needs help on python bindings and unit tests. Authors: - https://github.com/etseidl Approvers: - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #10882
Add a parameter to control the rowgroup size and the page size of the output Parquet file. The size can be specified in bytes or in the number of rows.
The text was updated successfully, but these errors were encountered: