-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] read_orc reads incorrect data on one row #5324
Comments
From the expanded data in the slack thread, it looks like perhaps a corner-case in RLE patched-base or delta mode. |
I can reproduce this, extracting stripe #26 (only 331 rows), there is one stray value very different than the others (points at RLE):
|
Weird, it seems to be the 2nd value of a 6-value RLEv2-mode1 sequence. |
Alrighty, we found the reason for the bug. Arrow ORC actually behaves different than their docs. 😡 In case of patched RLE, the size of the patch + patch gap is not actually equal to patch gap width + patch width as defined in the header. It's actually rounded up to some bit size defined here: https://github.com/apache/orc/blob/9c65bc116a2208d5d729687b7744b2ffbd72fea0/c%2B%2B/src/RLEV2Util.cc#L30-L37. This mapping doesn't seem to be in the documentation at all. |
Good job investigating. File an Apache issue? |
Describe the bug
When reading data via
cudf.read_orc
one of the rows seems to have incorrect data.Steps/Code to reproduce bug
Expected behavior
Value of
_col2
at index location3874
should be927000494
Environment overview (please complete the following information)
Additional context
Data here: row_bug.orc.zip
Thanks @trstovall for pointing out the issue.
cc: @OlivierNV
The text was updated successfully, but these errors were encountered: