Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] to_orc writes incorrect sum statistics when there's an overflow #9136

Closed
ayushdg opened this issue Aug 27, 2021 · 1 comment · Fixed by #9163
Closed

[BUG] to_orc writes incorrect sum statistics when there's an overflow #9136

ayushdg opened this issue Aug 27, 2021 · 1 comment · Fixed by #9163
Labels
bug Something isn't working cuIO cuIO issue Python Affects Python cuDF API.

Comments

@ayushdg
Copy link
Member

ayushdg commented Aug 27, 2021

Describe the bug
cudf.read_orc with certain predicate filters fails in cases where the sum of column values in the column being filtered exceeds int64 limits.

Steps/Code to reproduce bug

import cudf
import numpy as np

np.iinfo(np.int64).max
df = cudf.DataFrame()
df['key'] = [np.iinfo(np.int64).max - 1, 2]
df['val'] = [1,2]
df.to_orc("predicate_overflow.orc")
df = cudf.read_orc("predicate_overflow.orc",filters=[("key", ">", 0)]) 
print(len(df)) # 0

Expected behavior
The returned dataframe doesn't lose valid rows (2 rows in this case).

Environment overview (please complete the following information)

  • Environment location: bare-metal
  • Method of cuDF install: conda

Environment details
21.10 nightly from today (Aug 27)

Additional context
Some minor debugging indicated that the logic here fails when the col_sum returned from gathering metadata is a negative value because of overflow.

cc:@randerzander

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Aug 27, 2021
@shwina shwina added Python Affects Python cuDF API. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Aug 27, 2021
@ayushdg
Copy link
Member Author

ayushdg commented Aug 31, 2021

Quick update here: Looks like the issue might be with cudf.to_orc and how orc_statistics are written.
I verified for the example above that the raw orc statistics value for sum is a negative number (indicating an overflow).

From the orc specification: if the sum overflows long at any point during the calculation, no sum is recorded., so this seems to be a case where we are incorrectly including the sum statistic within the orc metadata.

@ayushdg ayushdg changed the title [BUG] read_orc predicate filter returns incorrect result when metadata sum overflows [BUG] to_orc writes incorrect sum statistics when there's an overflow Aug 31, 2021
@rapids-bot rapids-bot bot closed this as completed in #9163 Sep 2, 2021
rapids-bot bot pushed a commit that referenced this issue Sep 2, 2021
Closes #9136

When converting statistics chunks, has_sum is conditioned on the result of overflow detection. Detection is very pessimistic so sum is not included is all cases where there's a chance of overflow based on min/max values in the column.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Devavret Makkar (https://github.com/devavret)

URL: #9163
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants