Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC reading of files with struct columns that have null values #9005

Merged
merged 5 commits into from
Aug 17, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Aug 9, 2021

Fixes #8910

Number of values in the null stream of a child column depends on the number of valid elements in the parent column.

This PR changes the reading logic to account for the number of parent null values when parsing child null streams.
Namely, the output row is offset by the number of null values in the parent column, in all previous stripes. To allow efficient parsing, null counts are inclusive_scan'd before the columns in the level are parsed.

@vuule vuule self-assigned this Aug 9, 2021
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 9, 2021
@vuule vuule added bug Something isn't working non-breaking Non-breaking change cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed libcudf Affects libcudf (C++/CUDA) code. labels Aug 9, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 10, 2021
@codecov
Copy link

codecov bot commented Aug 10, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@115f3b6). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head db318b7 differs from pull request most recent head 9a854be. Consider uploading reports for the commit 9a854be to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9005   +/-   ##
===============================================
  Coverage                ?   10.65%           
===============================================
  Files                   ?      114           
  Lines                   ?    19077           
  Branches                ?        0           
===============================================
  Hits                    ?     2033           
  Misses                  ?    17044           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 115f3b6...9a854be. Read the comment docs.

@vuule vuule marked this pull request as ready for review August 10, 2021 03:07
@vuule vuule requested review from a team as code owners August 10, 2021 03:07
python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved
@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Aug 16, 2021
@vuule
Copy link
Contributor Author

vuule commented Aug 17, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 62c9312 into rapidsai:branch-21.10 Aug 17, 2021
@vuule vuule deleted the bug-orc-reader-struct-stripes branch August 17, 2021 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file.
4 participants