Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_orc_statistics incorrectly reads missing statistics values as 0 #9182

Closed
ayushdg opened this issue Sep 7, 2021 · 0 comments · Fixed by #9183
Closed

[BUG] read_orc_statistics incorrectly reads missing statistics values as 0 #9182

ayushdg opened this issue Sep 7, 2021 · 0 comments · Fixed by #9183
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@ayushdg
Copy link
Member

ayushdg commented Sep 7, 2021

Describe the bug
While reading orc_statistics with cuDF, missing attributes within a column statistics (like sum) are incorrectly read as 0.

Steps/Code to reproduce bug

import cudf
import pyorc
from io import BytesIO
import numpy as np

maxint = np.iinfo(np.int64).max

with open("numeric_overflow.orc","wb") as f:
    with pyorc.Writer(f, "struct<a:int,b:int>") as orc_writer:
        orc_writer.write((maxint, 1))
        orc_writer.write((1,1))


file_stats, stripe_stats = cudf.io.orc.read_orc_statistics(["numeric_overflow.orc"])

print(file_stats)
print(stripe_stats)
[{'col0': {'number_of_values': 2, 'has_null': False},
  'a': {'number_of_values': 2,
   'has_null': False,
   'minimum': 1,
   'maximum': 9223372036854775807,
   'sum': 0},
  'b': {'number_of_values': 2,
   'has_null': False,
   'minimum': 1,
   'maximum': 1,
   'sum': 2}}]


[{'col0': {'number_of_values': 2, 'has_null': False},
  'a': {'number_of_values': 2,
   'has_null': False,
   'minimum': 1,
   'maximum': 9223372036854775807,
   'sum': 0},
  'b': {'number_of_values': 2,
   'has_null': False,
   'minimum': 1,
   'maximum': 1,
   'sum': 2}}]

Expected behavior
The statistics values should be read is an None

 'a': {'number_of_values': 2,
   'has_null': False,
   'minimum': 1,
   'maximum': 9223372036854775807,
   'sum': None},

Environment overview (please complete the following information)

  • Environment location: bare-metal
  • Method of cuDF install: conda nightly (21.10)

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Seems to be a problem related to ProtoBuf default values if not checked.

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Sep 7, 2021
@ayushdg ayushdg added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Sep 7, 2021
@ayushdg ayushdg self-assigned this Sep 7, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 23, 2021
…9183)

Fixes #9182.


In cases where the `sum` statistic was not present in the orc file for int and float columns, the values would be incorrectly interpreted as 0 because of protobuf's [default](https://developers.google.com/protocol-buffers/docs/proto#optional) values when fields are missing.

This PR adds a check for field presence before assignment.

Authors:
  - Ayush Dattagupta (https://github.com/ayushdg)

Approvers:
  - Sheilah Kirui (https://github.com/skirui-source)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Marlene  (https://github.com/marlenezw)

URL: #9183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant