Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Control rowgroup and page size when writing Parquet files #9615

Closed
vuule opened this issue Nov 5, 2021 · 4 comments · Fixed by #9677 or #10882
Closed

[FEA] Control rowgroup and page size when writing Parquet files #9615

vuule opened this issue Nov 5, 2021 · 4 comments · Fixed by #9677 or #10882
Labels
cuIO cuIO issue feature request New feature or request

Comments

@vuule
Copy link
Contributor

vuule commented Nov 5, 2021

Add a parameter to control the rowgroup size and the page size of the output Parquet file. The size can be specified in bytes or in the number of rows.

@vuule vuule added feature request New feature or request cuIO cuIO issue labels Nov 5, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 17, 2021
Closes #9615

Adds the following API to the Parquet writer:

- Set maximum row group size, in bytes (minimum of 512KB);
- Set maximum row group size, in rows (minimum of 5000).

The API is more limited than its ORC equivalent because of limitation in Parquet page size control/estimation.

Other changes: 

- Fix naming in some ORC APIs to be consistent. 
- Change `rowgroup` to `row_group` in APIs, since Parquet specs refer to this as "row group", not "rowgroup". 
- Replace some `uint32_t` use in Parquet writer.
- Remove unused `target_page_size`.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)
  - Ashwin Srinath (https://github.com/shwina)

URL: #9677
@vuule
Copy link
Contributor Author

vuule commented Jan 21, 2022

Reopening - page size control has not been implemented.

@vuule vuule reopened this Jan 21, 2022
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@etseidl
Copy link
Contributor

etseidl commented May 17, 2022

Hi @vuule. Controlling page size is a feature of great interest to my group. I've taken a stab at implementing this functionality...would it be useful to submit a PR, or do you already have an implementation in mind and just need the time/priority to get it implemented?

@devavret
Copy link
Contributor

@etseidl Sure, go ahead.

rapids-bot bot pushed a commit that referenced this issue May 24, 2022
This PR fixes #9615 

Adds two more parameters to the Parquet writer options objects to control page sizes.  One sets a target page size in bytes (defaults to 512 KiB), and a second to set a maximum number of rows per page (defaults to 20000, see [this](https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/) for the rationale behind this choice).  

~~This also removes the validation logic (and unit tests) from the row_group_size setters, since the page size is no longer fixed.  Perhaps validation could be moved to the build() function.~~

I've tried to follow the naming convention from #9677

Still needs help on python bindings and unit tests.

Authors:
  - https://github.com/etseidl

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Bradley Dice (https://github.com/bdice)

URL: #10882
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request
Projects
None yet
3 participants