Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data corruption when reading ORC files with empty stripes #12160

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 16, 2022

Description

Closes #12155
Closes #12156

Two fixes:

  • Fixes the reader logic that used to mark entire (nesting) level of columns as having no data when one or more stripes have no data.
  • Removes the assert that failed when a non-struct column has no data in a stripe. There are several corner cases where this is valid input, some of which are not cheap to check.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue labels Nov 16, 2022
@vuule vuule self-assigned this Nov 16, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 16, 2022
@vuule vuule added non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Nov 16, 2022
@codecov
Copy link

codecov bot commented Nov 16, 2022

Codecov Report

Base: 88.25% // Head: 88.25% // No change to project coverage 👍

Coverage data is based on head (f15080f) compared to base (08c0c5a).
Patch has no changes to coverable lines.

Additional details and impacted files
@@              Coverage Diff              @@
##           branch-22.12   #12160   +/-   ##
=============================================
  Coverage         88.25%   88.25%           
=============================================
  Files               137      137           
  Lines             22571    22571           
=============================================
  Hits              19921    19921           
  Misses             2650     2650           
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/backends.py 85.17% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@revans2
Copy link
Contributor

revans2 commented Nov 16, 2022

This fixes the one file I uploaded, but it does not work when there are other stripes in the file that do have data. I'll see if I can come up with a synthetic file that shows this same issue.

@revans2
Copy link
Contributor

revans2 commented Nov 16, 2022

combined_error.zip holds an ORC file that has more rows in it and still is showing the problem that we had, and fits more with what our customer is actually seeing.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 19, 2022
@vuule vuule changed the title Fix a crash when reading an ORC file with an empty map column Fix data corruption when reading ORC files with empty stripes Nov 21, 2022
@vuule vuule marked this pull request as ready for review November 22, 2022 01:24
@vuule vuule requested a review from a team as a code owner November 22, 2022 01:24
@vuule vuule requested review from harrism and PointKernel and removed request for a team November 22, 2022 01:24
@vuule vuule requested a review from a team as a code owner November 22, 2022 04:50
@vuule vuule requested review from galipremsagar and brandon-b-miller and removed request for a team November 22, 2022 04:50
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants