We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug While reading orc_statistics with cuDF, missing attributes within a column statistics (like sum) are incorrectly read as 0.
Steps/Code to reproduce bug
import cudf import pyorc from io import BytesIO import numpy as np maxint = np.iinfo(np.int64).max with open("numeric_overflow.orc","wb") as f: with pyorc.Writer(f, "struct<a:int,b:int>") as orc_writer: orc_writer.write((maxint, 1)) orc_writer.write((1,1)) file_stats, stripe_stats = cudf.io.orc.read_orc_statistics(["numeric_overflow.orc"]) print(file_stats) print(stripe_stats)
[{'col0': {'number_of_values': 2, 'has_null': False}, 'a': {'number_of_values': 2, 'has_null': False, 'minimum': 1, 'maximum': 9223372036854775807, 'sum': 0}, 'b': {'number_of_values': 2, 'has_null': False, 'minimum': 1, 'maximum': 1, 'sum': 2}}] [{'col0': {'number_of_values': 2, 'has_null': False}, 'a': {'number_of_values': 2, 'has_null': False, 'minimum': 1, 'maximum': 9223372036854775807, 'sum': 0}, 'b': {'number_of_values': 2, 'has_null': False, 'minimum': 1, 'maximum': 1, 'sum': 2}}]
Expected behavior The statistics values should be read is an None
None
'a': {'number_of_values': 2, 'has_null': False, 'minimum': 1, 'maximum': 9223372036854775807, 'sum': None},
Environment overview (please complete the following information)
Environment details Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
cudf/print_env.sh
Additional context Seems to be a problem related to ProtoBuf default values if not checked.
The text was updated successfully, but these errors were encountered:
Fix logic while parsing the sum statistic for numerical orc columns (#…
817c3fa
…9183) Fixes #9182. In cases where the `sum` statistic was not present in the orc file for int and float columns, the values would be incorrectly interpreted as 0 because of protobuf's [default](https://developers.google.com/protocol-buffers/docs/proto#optional) values when fields are missing. This PR adds a check for field presence before assignment. Authors: - Ayush Dattagupta (https://github.com/ayushdg) Approvers: - Sheilah Kirui (https://github.com/skirui-source) - Vukasin Milovanovic (https://github.com/vuule) - Marlene (https://github.com/marlenezw) URL: #9183
ayushdg
Successfully merging a pull request may close this issue.
Describe the bug
While reading orc_statistics with cuDF, missing attributes within a column statistics (like sum) are incorrectly read as 0.
Steps/Code to reproduce bug
Expected behavior
The statistics values should be read is an
None
Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Seems to be a problem related to ProtoBuf default values if not checked.
The text was updated successfully, but these errors were encountered: