-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342
Fix an issue with multiple short list rowgroups using the Parquet chunked reader. #15342
Conversation
First, row groups of lists being loaded together were not getting their end row counts computed correctly on a per-rowgroup basis. The main consequence there was that we could potentially have generated splits that were larger than we would have liked. But downstream from this was a second bug where we were generating incorrect page indices to be decoded, causing corruption in the decode kernels.
Leaving as a draft until running it through all the Spark tests. |
Adding do-not-merge until pending changes run through the spark integration tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few small suggestions, looks good otherwise
tricky stuff
…oid unnecessarily looping in some cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the suggestions!
/merge |
Fixes #15306
The core issue here was that under certain conditions, the chunked reader could generate invalid page indices for list columns when using the chunked reader. This led to corruption in the decode kernels. The fix is fairly simple, but there's a decent amount of delta in this PR that is just name changes for clarity and some more comments/docs.
This affected the number of chunks generated in some of the very (unrealistically) constrained tests.
Checklist