[BUG] cudf.read_orc reads incorrect data for one row #5440

trstovall · 2020-06-10T20:25:03Z

Describe the bug
A clear and concise description of what the bug is.

Steps/Code to reproduce bug

>>> df = cudf.read_orc('to_orc_bug.orc')
>>> df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
>>> df.to_orc('to_orc_bug.orc', compression='snappy')
>>> df2 = cudf.read_orc('to_orc_bug.orc')
>>> df2.upc_nbr[(df2.visit_nbr == 14600028) & (df2.store_nbr == 47)]
999786    2526351652

Expected behavior
Returned value should be 681131184420, not 2526351652.

Environment overview (please complete the following information)

Method of cuDF install: Conda
0.14 nightly on ~ May 29

to_orc_bug.orc.zip

The text was updated successfully, but these errors were encountered:

kkraus14 · 2020-06-10T20:28:44Z

cc @devavret in case #5324 is related

devavret · 2020-06-11T00:05:01Z

Tried to read with pyarrow and it works.

import cudf
import pyarrow.orc as orc

df = cudf.read_orc("to_orc_bug.orc")
df.upc_nbr[(df.visit_nbr == 14600028) & (df.store_nbr == 47)] = 681131184420
df.to_orc("to_orc_bug2.orc", compression="snappy")

pdf = orc.ORCFile("to_orc_bug2.orc").read().to_pandas()
print(pdf[(pdf.visit_nbr == 14600028) & (pdf.store_nbr == 47)])

999786    6.811312e+11
Name: upc_nbr, dtype: float64

Seems to be a reader issue.

kkraus14 · 2020-06-11T01:07:07Z

Relabeled issue as such

devavret · 2020-06-11T19:22:43Z

This was an easy fix but I'm still trying to figure out how to properly add tests for this, or in general, anything in cuIO.

Fixes the narrowing conversion in bytestream reading in patched RLE

trstovall added Needs Triage Need team to review and classify bug Something isn't working labels Jun 10, 2020

kkraus14 added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jun 10, 2020

kkraus14 changed the title ~~[BUG] cudf.to_orc writes incorrect data for one row~~ [BUG] cudf.read_orc reads incorrect data for one row Jun 11, 2020

harrism assigned devavret Jun 11, 2020

devavret added a commit to devavret/cudf that referenced this issue Jun 15, 2020

Fix issue rapidsai#5440

1e30d8a

Fixes the narrowing conversion in bytestream reading in patched RLE

devavret mentioned this issue Jun 15, 2020

[WIP] Fix orc reader RLEv2 reader #5473

Merged

devavret closed this as completed in #5473 Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudf.read_orc reads incorrect data for one row #5440

[BUG] cudf.read_orc reads incorrect data for one row #5440

trstovall commented Jun 10, 2020

kkraus14 commented Jun 10, 2020

devavret commented Jun 11, 2020 •

edited

Loading

kkraus14 commented Jun 11, 2020

devavret commented Jun 11, 2020

[BUG] cudf.read_orc reads incorrect data for one row #5440

[BUG] cudf.read_orc reads incorrect data for one row #5440

Comments

trstovall commented Jun 10, 2020

kkraus14 commented Jun 10, 2020

devavret commented Jun 11, 2020 • edited Loading

kkraus14 commented Jun 11, 2020

devavret commented Jun 11, 2020

devavret commented Jun 11, 2020 •

edited

Loading