-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix maximum page size estimate in Parquet writer #11962
Fix maximum page size estimate in Parquet writer #11962
Conversation
page_size = | ||
1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8); | ||
// Additional byte to store entry bit width | ||
page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Apache Parquet format docs:
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there is already a gtest for this. The error was an OOB write and only manifested with a memcheck without an rmm pool which is tested nightly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Approving.
Codecov ReportBase: 88.09% // Head: 88.14% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11962 +/- ##
================================================
+ Coverage 88.09% 88.14% +0.04%
================================================
Files 133 133
Lines 21905 21982 +77
================================================
+ Hits 19298 19376 +78
+ Misses 2607 2606 -1
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@@ -217,6 +217,14 @@ __global__ void __launch_bounds__(128) | |||
if (frag_id < num_fragments_per_column and lane_id == 0) groups[column_id][frag_id] = *g; | |||
} | |||
|
|||
constexpr uint32_t max_RLE_page_size(uint8_t value_bit_width, uint32_t num_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice refactoring here.
page_size = | ||
1 + 5 + ((values_in_page * ck_g.dict_rle_bits + 7) >> 3) + (values_in_page >> 8); | ||
// Additional byte to store entry bit width | ||
page_size = 1 + max_RLE_page_size(ck_g.dict_rle_bits, values_in_page); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bug fix and the rest of the diff is refactoring, right? Is a C++ test or Python comparison to another reader/writer needed?
@gpucibot merge |
Description
Closes #11916
cuda memcheck reports an OOB write in one of the tests. The root cause is an underallocated buffer for encoded pages.
This PR fixes the computation of the maximum size of data pages (RLE encoded) when dictionary encoding is used.
Other changes:
Refactored max RLE page size computation to avoid code repetition.
Use actual dictionary index width instead of (outdated) worst case.
Checklist