Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

Closed
firestarman opened this issue Jul 30, 2021 · 3 comments · Fixed by #9005
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@firestarman
Copy link
Contributor

firestarman commented Jul 30, 2021

This can be reproduced by reading the attached ORC file ( struct_orc.log ).

Starting at line 995, you can find the data in nested column s1.sa are different.

>>> pandas.read_orc("/data/struct_845472261.orc")
               a             b                     s1                             s2
0    326215455.0  9.646704e+08   {'sa': 2056831253.0}  {'sa': {'ssa': 1140062029.0}}
1    326215455.0 -1.123178e+09   {'sa': 1735783038.0}                           None
2    326215455.0           NaN   {'sa': -335532064.0}  {'sa': {'ssa': 1998862860.0}}
3    326215455.0 -6.566856e+08  {'sa': -1463792165.0}                   {'sa': None}
4    326215455.0 -1.943089e+09   {'sa': 1762505979.0}  {'sa': {'ssa': -521321651.0}}
..           ...           ...                    ...                            ...
995  289956431.0  1.954978e+09    {'sa': 281568135.0}  {'sa': {'ssa': -873056332.0}}
996  289956431.0 -1.028878e+09   {'sa': -899673285.0}  {'sa': {'ssa': 1353167471.0}}
997  289956431.0  1.827136e+09   {'sa': -950538642.0}                           None
998  289956431.0  2.147484e+09  {'sa': -1024796596.0}  {'sa': {'ssa': 1126348031.0}}
999  289956431.0 -1.223061e+08                   None  {'sa': {'ssa': 1008923253.0}}

[1000 rows x 4 columns]
>>> 
>>> import cudf
>>> cudf.read_orc("/data/struct_845472261.orc")
             a            b                    s1                          s2
0    326215455    964670395   {'0': 2056831253.0}  {'0': {'0': 1140062029.0}}
1    326215455  -1123177776   {'0': 1735783038.0}                        None
2    326215455         <NA>   {'0': -335532064.0}  {'0': {'0': 1998862860.0}}
3    326215455   -656685553  {'0': -1463792165.0}                 {'0': None}
4    326215455  -1943088595   {'0': 1762505979.0}  {'0': {'0': -521321651.0}}
..         ...          ...                   ...                         ...
995  289956431   1954977719  {'0': -1219859255.0}  {'0': {'0': 2013006809.0}}
996  289956431  -1028878225   {'0': 1139661239.0}   {'0': {'0': 908256279.0}}
997  289956431   1827136073    {'0': 871691692.0}                        None
998  289956431   2147483647    {'0': 720591207.0}  {'0': {'0': 1773585916.0}}
999  289956431   -122306082                  None  {'0': {'0': 1801980339.0}}

[1000 rows x 4 columns]
>>> 

Similiar to the file mentioned in the issue #8878, this file "struct_orc.log" is also generated by Spark rapids. And both Pandas and Spark can read it correctly.

@firestarman
Copy link
Contributor Author

firestarman commented Jul 30, 2021

What's more, the nested column names seem to be dropped according to the output, instead "0" is used as the name for all the nested columns. I am not sure this is correct.

@vuule
Copy link
Contributor

vuule commented Aug 6, 2021

Some observations that might be useful:
in the s1 column, cudf turns the element in row 90 into null, and moves null element from row 99 to row 97. The data stream seems to be parsed correctly, just the null stream seems incorrect.

@rgsl888prabhu , there was an issue with the null masks in the writer when row groups stopped being divisible by 8. Some assumptions about byte alignment broke down and I had to adjust the indexing. Could something similar be happening here?
Edit: Just realized that this case is accounted for in the code.

@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Aug 6, 2021

Yeah, we are using atomic OR, but not sure where we are messing up.

rapids-bot bot pushed a commit that referenced this issue Aug 17, 2021
…9005)

Fixes #8910

Number of values in the null stream of a child column depends on the number of valid elements in the parent column.

This PR changes the reading logic to account for the number of parent null values when parsing child null streams.
Namely, the output row is offset by the number of null values in the parent column, in all previous stripes. To allow efficient parsing, null counts are inclusive_scan'd before the columns in the level are parsed.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

URL: #9005
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants