[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

firestarman · 2021-07-30T08:50:49Z

This can be reproduced by reading the attached ORC file ( struct_orc.log ).

Starting at line 995, you can find the data in nested column s1.sa are different.

>>> pandas.read_orc("/data/struct_845472261.orc")
               a             b                     s1                             s2
0    326215455.0  9.646704e+08   {'sa': 2056831253.0}  {'sa': {'ssa': 1140062029.0}}
1    326215455.0 -1.123178e+09   {'sa': 1735783038.0}                           None
2    326215455.0           NaN   {'sa': -335532064.0}  {'sa': {'ssa': 1998862860.0}}
3    326215455.0 -6.566856e+08  {'sa': -1463792165.0}                   {'sa': None}
4    326215455.0 -1.943089e+09   {'sa': 1762505979.0}  {'sa': {'ssa': -521321651.0}}
..           ...           ...                    ...                            ...
995  289956431.0  1.954978e+09    {'sa': 281568135.0}  {'sa': {'ssa': -873056332.0}}
996  289956431.0 -1.028878e+09   {'sa': -899673285.0}  {'sa': {'ssa': 1353167471.0}}
997  289956431.0  1.827136e+09   {'sa': -950538642.0}                           None
998  289956431.0  2.147484e+09  {'sa': -1024796596.0}  {'sa': {'ssa': 1126348031.0}}
999  289956431.0 -1.223061e+08                   None  {'sa': {'ssa': 1008923253.0}}

[1000 rows x 4 columns]
>>> 
>>> import cudf
>>> cudf.read_orc("/data/struct_845472261.orc")
             a            b                    s1                          s2
0    326215455    964670395   {'0': 2056831253.0}  {'0': {'0': 1140062029.0}}
1    326215455  -1123177776   {'0': 1735783038.0}                        None
2    326215455         <NA>   {'0': -335532064.0}  {'0': {'0': 1998862860.0}}
3    326215455   -656685553  {'0': -1463792165.0}                 {'0': None}
4    326215455  -1943088595   {'0': 1762505979.0}  {'0': {'0': -521321651.0}}
..         ...          ...                   ...                         ...
995  289956431   1954977719  {'0': -1219859255.0}  {'0': {'0': 2013006809.0}}
996  289956431  -1028878225   {'0': 1139661239.0}   {'0': {'0': 908256279.0}}
997  289956431   1827136073    {'0': 871691692.0}                        None
998  289956431   2147483647    {'0': 720591207.0}  {'0': {'0': 1773585916.0}}
999  289956431   -122306082                  None  {'0': {'0': 1801980339.0}}

[1000 rows x 4 columns]
>>>

Similiar to the file mentioned in the issue #8878, this file "struct_orc.log" is also generated by Spark rapids. And both Pandas and Spark can read it correctly.

The text was updated successfully, but these errors were encountered:

firestarman · 2021-07-30T09:01:57Z

What's more, the nested column names seem to be dropped according to the output, instead "0" is used as the name for all the nested columns. I am not sure this is correct.

vuule · 2021-08-06T03:54:17Z

Some observations that might be useful:
in the s1 column, cudf turns the element in row 90 into null, and moves null element from row 99 to row 97. The data stream seems to be parsed correctly, just the null stream seems incorrect.

@rgsl888prabhu , there was an issue with the null masks in the writer when row groups stopped being divisible by 8. Some assumptions about byte alignment broke down and I had to adjust the indexing. Could something similar be happening here?
Edit: Just realized that this case is accounted for in the code.

rgsl888prabhu · 2021-08-06T16:22:41Z

Yeah, we are using atomic OR, but not sure where we are messing up.

…9005) Fixes #8910 Number of values in the null stream of a child column depends on the number of valid elements in the parent column. This PR changes the reading logic to account for the number of parent null values when parsing child null streams. Namely, the output row is offset by the number of null values in the parent column, in all previous stripes. To allow efficient parsing, null counts are inclusive_scan'd before the columns in the level are parsed. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - GALI PREM SAGAR (https://github.com/galipremsagar) - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) URL: #9005

firestarman added bug Something isn't working Needs Triage Need team to review and classify labels Jul 30, 2021

This was referenced Jul 30, 2021

[BUG] orc_test:test_pred_push_round_trip failed NVIDIA/spark-rapids#3059

Closed

[FEA] Support NESTED_SCHEMA_PRUNING_ENABLED for ORC NVIDIA/spark-rapids#463

Closed

[FEA] ORC reader supports reading Struct columns. NVIDIA/spark-rapids#2879

Closed

firestarman mentioned this issue Jul 30, 2021

[BUG] cudf reads orc file failed. #8878

Closed

vuule added the cuIO cuIO issue label Jul 30, 2021

vuule mentioned this issue Aug 3, 2021

Document that struct support in read_orc is experimental #8932

Merged

beckernick removed the Needs Triage Need team to review and classify label Aug 3, 2021

vuule mentioned this issue Aug 5, 2021

[BUG] nested column names of a cuDF dataframe created by read_orc are not printed #8963

Closed

vuule assigned vuule and rgsl888prabhu Aug 6, 2021

vuule mentioned this issue Aug 9, 2021

Fix ORC reading of files with struct columns that have null values #9005

Merged

rapids-bot bot closed this as completed in #9005 Aug 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

firestarman commented Jul 30, 2021 •

edited

Loading

firestarman commented Jul 30, 2021 •

edited

Loading

vuule commented Aug 6, 2021 •

edited

Loading

rgsl888prabhu commented Aug 6, 2021 •

edited

Loading

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

[BUG] ORC reader produces different output than Pandas ORC reader reading a specific ORC file. #8910

Comments

firestarman commented Jul 30, 2021 • edited Loading

firestarman commented Jul 30, 2021 • edited Loading

vuule commented Aug 6, 2021 • edited Loading

rgsl888prabhu commented Aug 6, 2021 • edited Loading

firestarman commented Jul 30, 2021 •

edited

Loading

firestarman commented Jul 30, 2021 •

edited

Loading

vuule commented Aug 6, 2021 •

edited

Loading

rgsl888prabhu commented Aug 6, 2021 •

edited

Loading