-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910
Comments
What's more, the nested column names seem to be dropped according to the output, instead "0" is used as the name for all the nested columns. I am not sure this is correct. |
Some observations that might be useful:
|
Yeah, we are using atomic OR, but not sure where we are messing up. |
…9005) Fixes #8910 Number of values in the null stream of a child column depends on the number of valid elements in the parent column. This PR changes the reading logic to account for the number of parent null values when parsing child null streams. Namely, the output row is offset by the number of null values in the parent column, in all previous stripes. To allow efficient parsing, null counts are inclusive_scan'd before the columns in the level are parsed. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - GALI PREM SAGAR (https://github.com/galipremsagar) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) URL: #9005
This can be reproduced by reading the attached ORC file ( struct_orc.log ).
Starting at line 995, you can find the data in nested column
s1.sa
are different.Similiar to the file mentioned in the issue #8878, this file "struct_orc.log" is also generated by Spark rapids. And both Pandas and Spark can read it correctly.
The text was updated successfully, but these errors were encountered: