-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing #11752
Parquet reader: bug fix for a num_rows/skip_rows corner case, w/optimization for nested preprocessing #11752
Conversation
…taining a mix of nested and non-nested types would result in incorrect row counts for the non-nested types. Also optimizes the preprocess path so that non-nested types do not end up getting visited by the kernel.
…ists. Fixed an additional issue in the decoding where flat column types underneath structs could end up ignoring skip_rows/num_rows.
Wrongly switched branch. You may need to reverse changes from branch 22.12 and up merge with 22.10 instead. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this is more understandable now. Thanks!
// it would be better/safer to be checking (schema.max_repetition_level > 0) here, but there's | ||
// no easy way to get at that info here. we'd have to move this function into reader_impl.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be an issue for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right. This ended up being a bug :) If you have a struct at the top of a nested hierarchy, this logic fails. The max_repetition_level
check is the correct one.
Codecov ReportBase: 87.40% // Head: 87.50% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11752 +/- ##
================================================
+ Coverage 87.40% 87.50% +0.09%
================================================
Files 133 133
Lines 21833 21826 -7
================================================
+ Hits 19084 19099 +15
+ Misses 2749 2727 -22
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@gpucibot merge |
Fixes NVIDIA/spark-rapids#6718 There was a bug introduced recently #11752 where an insufficient check for whether an input column contained repetition information could cause incorrect results for column hierarchies with structs at the root. Authors: - https://github.com/nvdbaranec Approvers: - Jim Brennan (https://github.com/jbrennan333) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: #11910
Fixes an issue where using user bounds with parquet files containing both nested and non-nested types could result in incorrect row counts for the non-nested columns. Originally reported by @etseidl
The nature of the fix also implements a longstanding desired optimization: when running the preprocess step for nested types, ignore pages for non-nested hierarchies. This can result in significant speedups for files containing only a few nested columns.
The tests added for this PR seem to tease a bug in the parquet writer into happening (#11748) so I will leave this as a draft until that issue is resolved.