Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf.read_orc reads incorrect data for one row #5440

Closed
trstovall opened this issue Jun 10, 2020 · 4 comments · Fixed by #5473
Closed

[BUG] cudf.read_orc reads incorrect data for one row #5440

trstovall opened this issue Jun 10, 2020 · 4 comments · Fixed by #5473
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@trstovall
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug

>>> df = cudf.read_orc('to_orc_bug.orc')
>>> df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
>>> df.to_orc('to_orc_bug.orc', compression='snappy')
>>> df2 = cudf.read_orc('to_orc_bug.orc')
>>> df2.upc_nbr[(df2.visit_nbr == 14600028) & (df2.store_nbr == 47)]
999786    2526351652

Expected behavior
Returned value should be 681131184420, not 2526351652.

Environment overview (please complete the following information)

  • Method of cuDF install: Conda
  • 0.14 nightly on ~ May 29

to_orc_bug.orc.zip

@trstovall trstovall added Needs Triage Need team to review and classify bug Something isn't working labels Jun 10, 2020
@kkraus14 kkraus14 added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 10, 2020
@kkraus14
Copy link
Collaborator

cc @devavret in case #5324 is related

@devavret
Copy link
Contributor

devavret commented Jun 11, 2020

Tried to read with pyarrow and it works.

import cudf
import pyarrow.orc as orc

df = cudf.read_orc("to_orc_bug.orc")
df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
df.to_orc("to_orc_bug2.orc", compression="snappy")

pdf = orc.ORCFile("to_orc_bug2.orc").read().to_pandas()
print(pdf[(pdf.visit_nbr == 14600028) & (pdf.store_nbr == 47)])
999786    6.811312e+11
Name: upc_nbr, dtype: float64

Seems to be a reader issue.

@kkraus14 kkraus14 changed the title [BUG] cudf.to_orc writes incorrect data for one row [BUG] cudf.read_orc reads incorrect data for one row Jun 11, 2020
@kkraus14
Copy link
Collaborator

Relabeled issue as such

@devavret
Copy link
Contributor

This was an easy fix but I'm still trying to figure out how to properly add tests for this, or in general, anything in cuIO.

devavret added a commit to devavret/cudf that referenced this issue Jun 15, 2020
Fixes the narrowing conversion in bytestream reading in patched RLE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants