Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_orc reads incorrect data on one row #5324

Closed
ayushdg opened this issue May 29, 2020 · 5 comments · Fixed by #5473
Closed

[BUG] read_orc reads incorrect data on one row #5324

ayushdg opened this issue May 29, 2020 · 5 comments · Fixed by #5473
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@ayushdg
Copy link
Member

ayushdg commented May 29, 2020

Describe the bug
When reading data via cudf.read_orc one of the rows seems to have incorrect data.

Steps/Code to reproduce bug

import cudf

df = cudf.read_orc("row_bug.orc")
df.iloc[3874]

_col0     10100163
_col1         4125
_col2    637000238
Name: 3874, dtype: int32

Expected behavior

import pyarrow.orc as orc
pdf = orc.ORCFile("row_bug.orc").read().to_pandas()
pdf.iloc[3874]

_col0     10100163
_col1         4125
_col2    927000494
Name: 3874, dtype: int32

Value of _col2 at index location 3874 should be 927000494

Environment overview (please complete the following information)

  • Method of cuDF install: Conda
    • 0.14 nightly on May 29

Additional context
Data here: row_bug.orc.zip

Thanks @trstovall for pointing out the issue.

cc: @OlivierNV

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels May 29, 2020
@ayushdg ayushdg added the cuIO cuIO issue label May 29, 2020
@OlivierNV
Copy link
Contributor

From the expanded data in the slack thread, it looks like perhaps a corner-case in RLE patched-base or delta mode.

@OlivierNV
Copy link
Contributor

I can reproduce this, extracting stripe #26 (only 331 rows), there is one stray value very different than the others (points at RLE):

[324] 347000041 (0x14aecce9)
[325] 347000011 (0x14aecccb)
[326] 927000463 (0x3740e78f)
[327] 347000002 (0x14aeccc2)
[328] 347000003 (0x14aeccc3)
[329] 347000036 (0x14aecce4)
[330] 347000041 (0x14aecce9)

@OlivierNV
Copy link
Contributor

OlivierNV commented May 29, 2020

Weird, it seems to be the 2nd value of a 6-value RLEv2-mode1 sequence.
[Edit] Weirder: I'm seeing the value as 927000463, which appears to be the correct value (not sure if the bug description matches the file). Also tracked this all the way into snappy to exclude decompression errors, which also backreferences these 927000463 values from earlier values in the file (these appear periodically), which seems to be consistent.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 5, 2020
@devavret
Copy link
Contributor

Alrighty, we found the reason for the bug. Arrow ORC actually behaves different than their docs. 😡 In case of patched RLE, the size of the patch + patch gap is not actually equal to patch gap width + patch width as defined in the header. It's actually rounded up to some bit size defined here: https://github.com/apache/orc/blob/9c65bc116a2208d5d729687b7744b2ffbd72fea0/c%2B%2B/src/RLEV2Util.cc#L30-L37. This mapping doesn't seem to be in the documentation at all.

@harrism
Copy link
Member

harrism commented Jun 11, 2020

Good job investigating. File an Apache issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants