-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix page size estimation in Parquet writer #13364
Conversation
Pull requests from external contributors require approval from a |
// a "worst-case" for page size estimation in the presence of a dictionary. have confirmed | ||
// that this fails for values over 16 in the final term of `max_RLE_page_size()`. | ||
auto elements0 = cudf::detail::make_counting_transform_iterator(0, [](auto i) { | ||
if ((i / 5) % 2 == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the pattern changed every five elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worked 🤣 This was totally empirical...I changed the divisor until I started getting overwrites in the page encoder. Printing out the run lengths, it seemed to give literal runs of 8 followed by repeated runs of 2. Pretty close to ideal for the purposes of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seemed to give literal runs of 8 followed by repeated runs of 2
Can you please add this info to the comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking seriously about it, the pattern is CCCCCRRRRRCCCCCRRRRR...
, where C changes, R is repeating. The encoder will take the first CCCCCRRR
and emit a literal run of 8, followed by a repeated run of RR
. Then this repeats with another CCCCCRRR
, RR
. I'd like to say this was rationally designed, but I got lucky 😄
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great detective work, thank you for the fix!
I've been thinking about it some more, and for small dictionary bitwidths I think this will still have issues. The real size for this 8/2 case is going to be Going to switch to draft until I can pin this down...don't want to have to hunt this down again 😅 |
I did more testing, and when trying to get to low enough bitwidths to trigger problems, the encoder wound up picking different splits, so the current formula was still an over estimation. I'm going to leave |
Do you have any benchmark for this fix? |
/ok to test |
I'm curious too what the downstream effects are of this. Is it something along the lines of:
? |
By this point the decision to use a dictionary has been made. This is just calculating the size of buffer needed to hold the encoded data. For the given data, the size estimate is too small, so the encoder overruns the buffer and injects bad keys into a different page's buffer. This is just a more conservative estimate of the space needed to encode. I'd expect a small increase in peak memory, but will be surprised if it's major. More like a few hundred bytes per page. |
Ah, I see. Thanks for the explanation. |
Here's the peak mem from nvbench where there was a difference. The overhead was a bit more than I thought, but still just under 2% overall.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp approval.
/merge |
Description
Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the
bit-packed-header
used when encoding bit-packed literal runs. This term originally used a value of256
, but it's hard to see where that value comes from. This PR change the value to8
, with a possible justification being the minimum length of a literal run is 8 values. Worst case would be multiple runs of 8, with required overhead bytes then beingnum_values/8
. This also adds a test that has been verified to fail for values larger than 16 in the problematic term.Checklist