-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC reader produces different struct rows than Pandas ORC reader when there are null rows. #8704
Comments
I'm able to reproduce this issue with the attached minimal reproducing file |
@rgsl888prabhu identified the root cause. ORC reader expects an element in the null stream for each element in the nested columns, regardless of validity of the parent column. Such elements are actually excluded from the null stream (not just from the data streams). The cuDF reader ends up skipping some rows in the nested columns, as the repro shows. The fix is non-trivial as we need to take parent columns' validity into account when parsing nested null streams. |
Thanks. |
I will try to get the fix with-in 21.08, but since the fix is non-trivial, I am not 100% sure. |
Update:
|
Ready for review #8819 |
In case of liborc, pyarrow and pyorc: If the parent has a null element, that element is skipped while writing child data, and same goes with mask So, you would have to keep track of null count and null mask in parent column, so that you can merge both the parent and child null masks. In case of pyspark, spark: If the parent has a null element, and if child column also has null element, then upper explanation holds. But if all the child rows are valid, then you need to copy the mask from parent. These scenarios have been take care in the code changes. Earlier struct column and its child columns used to be in the same level of nesting, but since we need parent null mask before decoding child, changes have been made so that child columns will be moved one level down for all types of nested columns. closes #8704 Authors: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Devavret Makkar (https://github.com/devavret) - Vukasin Milovanovic (https://github.com/vuule) URL: #8819
test.log
The bug can be reproduced by reading the attached orc file. (test.log)
The output of Pandas ORC reader.
The output of the cudf ORC reader
The text was updated successfully, but these errors were encountered: