Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix logic while parsing the sum statistic for numerical orc columns #9183

Merged
merged 4 commits into from
Sep 23, 2021

Conversation

ayushdg
Copy link
Member

@ayushdg ayushdg commented Sep 7, 2021

Fixes #9182.

In cases where the sum statistic was not present in the orc file for int and float columns, the values would be incorrectly interpreted as 0 because of protobuf's default values when fields are missing.

This PR adds a check for field presence before assignment.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 7, 2021
@codecov
Copy link

codecov bot commented Sep 7, 2021

Codecov Report

Merging #9183 (f078caa) into branch-21.10 (3ee3ecf) will decrease coverage by 0.05%.
The diff coverage is 11.44%.

❗ Current head f078caa differs from pull request most recent head f7f8808. Consider uploading reports for the commit f7f8808 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10    #9183      +/-   ##
================================================
- Coverage         10.85%   10.80%   -0.06%     
================================================
  Files               115      116       +1     
  Lines             19158    19318     +160     
================================================
+ Hits               2080     2087       +7     
- Misses            17078    17231     +153     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/text.py 0.00% <0.00%> (ø)
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c431650...f7f8808. Read the comment docs.

@ayushdg ayushdg marked this pull request as ready for review September 14, 2021 12:42
@ayushdg ayushdg requested a review from a team as a code owner September 14, 2021 12:42
@ayushdg ayushdg self-assigned this Sep 14, 2021
@ayushdg
Copy link
Member Author

ayushdg commented Sep 15, 2021

rerun tests

@ayushdg ayushdg added the non-breaking Non-breaking change label Sep 15, 2021
@ayushdg
Copy link
Member Author

ayushdg commented Sep 15, 2021

Seeing unrelated build errors:

fatal error: An error occurred (404) when calling the HeadObject operation: Key "rapidsai/cudf/pull-request/9183/cpu/flash-cudf-2e3fa15f7bc6e7105d18a7f602302b7fc9994855-11.0-x86_64.tgz" does not exist

rerun tests

@ayushdg ayushdg added the bug Something isn't working label Sep 15, 2021
@galipremsagar
Copy link
Contributor

rerun tests

@ayushdg
Copy link
Member Author

ayushdg commented Sep 17, 2021

rerun tests

1 similar comment
@ayushdg
Copy link
Member Author

ayushdg commented Sep 17, 2021

rerun tests

assert stripe_stats[0]["c"].get("sum") == minint64 + 1


def test_empty_statistics():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be one more test to check all the statistics value for a proper table of all types of column with values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, right now the other read_orc_statistics test only checks for int and bool types. I'll add one/modify existing to include the different column types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of time, I'll add one more test in a followup PR.

@rgsl888prabhu
Copy link
Contributor

Rest looks good.

@ayushdg ayushdg added 0 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Sep 20, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@marlenezw marlenezw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my end!

@ayushdg ayushdg removed the 0 - Waiting on Author Waiting for author to respond to review label Sep 23, 2021
@ayushdg ayushdg added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Sep 23, 2021
@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 817c3fa into rapidsai:branch-21.10 Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] read_orc_statistics incorrectly reads missing statistics values as 0
6 participants