-
Notifications
You must be signed in to change notification settings - Fork 224
Fixed reading parquet binary dict page #791
Fixed reading parquet binary dict page #791
Conversation
let length = std::cmp::min(pack_size, pack_remaining); | ||
|
||
let additional = remaining.min(length); | ||
let additional = pack_size.min(remaining); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the full page length instead of the remaining length is one of the bugs.
let values_iterator = values_iter(indices_buffer, dict, additional); | ||
|
||
let mut validity_iterator = hybrid_rle::Decoder::new(validity_buffer, 1); | ||
|
||
extend_from_decoder( | ||
validity, | ||
&mut validity_iterator, | ||
length, | ||
additional, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter mixup; extend_from_decoder expects the additional length, but this was passing total length.
FWIW on the surface this sounds similar to apache/arrow-rs#1111 |
Great fix! Thanks! My understanding is that the This has shown up some times:
My understanding is that for nested types
which overshoots the number of values on the primitive array ( I suggest we move the |
Codecov Report
@@ Coverage Diff @@
## main #791 +/- ##
==========================================
+ Coverage 71.05% 71.07% +0.02%
==========================================
Files 319 319
Lines 16667 16672 +5
==========================================
+ Hits 11843 11850 +7
+ Misses 4824 4822 -2
Continue to review full report at Codecov.
|
Got it, thanks for the explanation. I've gone ahead and moved the checks to within the non-nested branch. |
@jorgecarleitao one thing that occurred to me after having some time to digest your explanation - if the length/capacity as used in those methods isn't accurate for nested columns, then the capacity reservations for the data and validity buffers is going to be too large. Obviously won't cause bugs or errors, but perhaps something that could be optimized. |
And thanks for the quick help on this! 🙏 |
As described in #790, there is a bug in parquet binary dict page decoding whereby too many values are yielded due to a mixup in the length arguments passed to various decoding functions, ultimately leading to extra bitpacked values being included in the output array. This PR fixes the bug, and adds some
debug_assert!
s in key places to ensure that tests would catch the issue in the future. I went ahead and added similar debug asserts to the other page type read functions, which exposed a few places where validity buffers are being allocated for non-nullable columns.Unfortunately these new debug asserts are firing for nested page type tests, but I don't currently understand how the encoding works for these page types (it looks significantly more complicated than RLE & bitpack). I suspect that these test failures are a true positive, but I can't be sure without understanding how nested encoding works.