-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues with parquet chunked reader #12488
Fix issues with parquet chunked reader #12488
Conversation
- The num_nesting_levels field in the gpu::PageNestingInfo struct wasn't using the right value in all cases. This was benign previously, but the chunked reader fell victim to it. - Fixed an issue with an optimization included in the chunked reader : we weren't properly determining which pages could be ignored during the preprocess step in some cases.
Codecov ReportBase: 86.58% // Head: 85.70% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #12488 +/- ##
================================================
- Coverage 86.58% 85.70% -0.88%
================================================
Files 155 155
Lines 24368 24865 +497
================================================
+ Hits 21098 21311 +213
- Misses 3270 3554 +284
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Thanks for working on this. In addition to the code, do we have any unit test to catch the bug? |
At the end of the day, it hasn't been possible to get cudf or Spark to generate a useful test file here. Should I just check the "New or existing tests cover this change" box anyway? I'm not sure what else to do. |
I think it should be left unchecked. The reason for the missing test is documented in the description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As expected, this one's really in the weeds. It would be best if the other C++ review comes from @ttnghia or @PointKernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Fixes an issue with a particular arrangement of page data related to lists. Specifically, it is possible for page `N` to contain "0" rows because the values for the row it is a part of start on page `N-1` and end on page `N+1`. This was defeating logic in the decode kernel that would erroneously cause these values to be skipped. Similar to #12488 this is only reproducible with data out in the wild. In this case, we have a file that we could in theory check in to create a test with, but it is 16 MB so it's fairly large. Looking for feedback on whether this is too big. Authors: - https://github.com/nvdbaranec - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Bradley Dice (https://github.com/bdice) - Vukasin Milovanovic (https://github.com/vuule) URL: #12698
Fixes two issues:
A very old issue where we were incorrectly setting the output nesting depth in the PageNestingInfo struct in some cases. Previously, this was benign, but the chunked reader broke because of it. I fixed this and did some variable renaming to make things more clear.
Fixed an issue with an optimization that was added for the chunked reader: We were incorrectly determining when we could early out for a given page during the preprocessing step.
There is an issue as far as generating a test for this PR. The conditions under which the second bug occurs require that values from a given row span multiple pages. I couldn't get the cudf writer to make this happen. The alternative is to use a pre-created file and use the python tests, but the issue there is that we don't expose skip_rows/num_rows through that API. So I have no good way of building a test.
This fix has been vetted against a known real-world failure case.
Fixes #12376
Checklist