-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Variable fragment sizes for Parquet writer #12685
Variable fragment sizes for Parquet writer #12685
Conversation
Pull requests from external contributors require approval from a |
This PR retains the performance gains seen in #12627 for LIST benchmarks. 23.04:
This PR:
|
For parquet_write_io_compression
This PR:
|
fa06424
to
b8bb1dd
Compare
LIST write benchmark got even better with the 4 fragments/page change, but this is due to allowing more parallelism in page encoding and compression due to simply adding more pages of smaller size. Kind of a cheat 😅
|
c674c23
to
138d5f4
Compare
Co-authored-by: Vukasin Milovanovic <[email protected]>
Co-authored-by: Vukasin Milovanovic <[email protected]>
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, thank you for implementing this!
It's complicated, but not as much as it's beneficial :)
Thanks for all the help! Going to release it into the wild now :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the use of std::optional
in here over a magic default value. Some comments, but mostly nits.
/ok to test |
/ok to test |
/merge |
Fixes #12867. Bug introduced in #12685. A calculation of total bytes in a column was returned in a 32-bit `size_type` rather than 64-bit `size_t` leading to overflow for tables with many millions of rows. Authors: - Ed Seidl (https://github.com/etseidl) - Vukasin Milovanovic (https://github.com/vuule) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #12870
…ragment size (#13806) #12685 introduced a bug in page calculation. If the `max_page_size_rows` parameter is set smaller than the page fragment size, the writer will produce a spurious empty page. This PR fixes this by only checking the fragment size if there are already rows in the page, and then returns the old check for number of rows exceeding the page limit. Interestingly, libcudf can read these files with empty pages just fine, but parquet-mr cannot. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Nghia Truong (https://github.com/ttnghia) - Vukasin Milovanovic (https://github.com/vuule) URL: #13806
Description
Fixes #12613
This PR adds the ability for columns to have different fragment sizes. This allows a large fragment size for narrow columns, but allows for finer grained fragments for very wide columns. This change should make wide columns fit (approximately) within page size constraints, and should help with compressors that rely on pages being under a certain threshold (i.e. Zstandard).
Checklist