Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC reader produces different struct rows than Pandas ORC reader when there are null rows. #8704

Closed
firestarman opened this issue Jul 9, 2021 · 6 comments · Fixed by #8819
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@firestarman
Copy link
Contributor

firestarman commented Jul 9, 2021

test.log

The bug can be reproduced by reading the attached orc file. (test.log)

The output of Pandas ORC reader.

>>> cd = pd.read_orc("/data/tmp/test.log")
>>> cd
                                                                                                                                                                                       _c0
0                                                                                                                                                                                     None
1       {'child0': -21.0, 'child1': 770290356.0, 'child2': -7.932020604150343e+18, 'child3': -9.540890298415513e+123, 'child4': True, 'child5': 8819-10-21, 'child6': -544668600592309696}
2       {'child0': 112.0, 'child1': 741485586.0, 'child2': 9.223372036854776e+18, 'child3': -2.1696969883753554e-292, 'child4': True, 'child5': 7810-07-27, 'child6': 8297071309727241920}
3   {'child0': -50.0, 'child1': -2094070070.0, 'child2': 4.4090692282843546e+18, 'child3': -1.4882040107841963e+286, 'child4': True, 'child5': 9247-09-19, 'child6': -1488240148641067776}
4       {'child0': -49.0, 'child1': -228841867.0, 'child2': 8.526868855009764e+18, 'child3': -5.015697809755203e+126, 'child4': True, 'child5': 5964-11-22, 'child6': 9168135723176380608}
5    {'child0': -55.0, 'child1': -2116626173.0, 'child2': 6.625620584231136e+18, 'child3': -6.980585330052296e+279, 'child4': False, 'child5': 4359-12-22, 'child6': -8819903668114758080}
6      {'child0': None, 'child1': 2117211837.0, 'child2': -3.546155898819671e+18, 'child3': 3.997339500801487e+241, 'child4': False, 'child5': 4687-09-06, 'child6': -5843357623079309696}
7    {'child0': -61.0, 'child1': -456120487.0, 'child2': -1.4714653771146145e+18, 'child3': 6.705561243361668e+217, 'child4': False, 'child5': 2941-01-02, 'child6': -6743833081248309696}
8                                                                                                                                                                                     None
9       {'child0': 126.0, 'child1': -850819620.0, 'child2': 2.561856582238531e+18, 'child3': -3.010452936419635e-233, 'child4': False, 'child5': 5298-03-10, 'child6': 523475195910380608}
10    {'child0': 88.0, 'child1': 2143646677.0, 'child2': -1.2687481690450127e+18, 'child3': 1.4029358106730946e-59, 'child4': False, 'child5': 4286-02-08, 'child6': -1595456785955206464}
11      {'child0': -18.0, 'child1': -1698102546.0, 'child2': 8.909578400170195e+18, 'child3': 2.187766760184779e+306, 'child4': True, 'child5': 8972-05-15, 'child6': 3526912144283241920}
12   {'child0': -121.0, 'child1': 1460680657.0, 'child2': 2.0641369428828419e+18, 'child3': -2.646144697462316e-35, 'child4': False, 'child5': 7114-04-24, 'child6': -2142515480837758080}
>>> 

The output of the cudf ORC reader

>>> gd = cu.read_orc("/data/tmp/test.log")
>>> gd
                                                                                                                                                                                                _c0
0                                                                                                                                                                                              None
1       {'child0': 112.0, 'child1': 741485586.0, 'child2': 9.223372036854776e+18, 'child3': -2.1696969883753554e-292, 'child4': True, 'child5': 7810-07-27 00:00:00, 'child6': 8297071309727241920}
2   {'child0': -50.0, 'child1': -2094070070.0, 'child2': 4.4090692282843546e+18, 'child3': -1.4882040107841963e+286, 'child4': True, 'child5': 9247-09-19 00:00:00, 'child6': -1488240148641067776}
3       {'child0': -49.0, 'child1': -228841867.0, 'child2': 8.526868855009764e+18, 'child3': -5.015697809755203e+126, 'child4': True, 'child5': 5964-11-22 00:00:00, 'child6': 9168135723176380608}
4    {'child0': -55.0, 'child1': -2116626173.0, 'child2': 6.625620584231136e+18, 'child3': -6.980585330052296e+279, 'child4': False, 'child5': 4359-12-22 00:00:00, 'child6': -8819903668114758080}
5      {'child0': None, 'child1': 2117211837.0, 'child2': -3.546155898819671e+18, 'child3': 3.997339500801487e+241, 'child4': False, 'child5': 4687-09-06 00:00:00, 'child6': -5843357623079309696}
6    {'child0': -61.0, 'child1': -456120487.0, 'child2': -1.4714653771146145e+18, 'child3': 6.705561243361668e+217, 'child4': False, 'child5': 2941-01-02 00:00:00, 'child6': -6743833081248309696}
7       {'child0': 126.0, 'child1': -850819620.0, 'child2': 2.561856582238531e+18, 'child3': -3.010452936419635e-233, 'child4': False, 'child5': 5298-03-10 00:00:00, 'child6': 523475195910380608}
8                                                                                                                                                                                              None
9       {'child0': -18.0, 'child1': -1698102546.0, 'child2': 8.909578400170195e+18, 'child3': 2.187766760184779e+306, 'child4': True, 'child5': 8972-05-15 00:00:00, 'child6': 3526912144283241920}
10   {'child0': -121.0, 'child1': 1460680657.0, 'child2': 2.0641369428828419e+18, 'child3': -2.646144697462316e-35, 'child4': False, 'child5': 7114-04-24 00:00:00, 'child6': -2142515480837758080}
11                                    {'child0': None, 'child1': 49.0, 'child2': 91.0, 'child3': 2.21338557185e-313, 'child4': False, 'child5': 1970-02-11 00:00:00, 'child6': 1420070458000000000}
12                           {'child0': None, 'child1': 741485586.0, 'child2': 9.223372036854776e+18, 'child3': 0.0, 'child4': False, 'child5': 7810-07-27 00:00:00, 'child6': 8297071309452241920}
@firestarman firestarman added bug Something isn't working Needs Triage Need team to review and classify labels Jul 9, 2021
@beckernick
Copy link
Member

I'm able to reproduce this issue with the attached minimal reproducing file

@beckernick beckernick added Python Affects Python cuDF API. cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jul 12, 2021
@rgsl888prabhu rgsl888prabhu self-assigned this Jul 12, 2021
@vuule
Copy link
Contributor

vuule commented Jul 15, 2021

@rgsl888prabhu identified the root cause. ORC reader expects an element in the null stream for each element in the nested columns, regardless of validity of the parent column. Such elements are actually excluded from the null stream (not just from the data streams). The cuDF reader ends up skipping some rows in the nested columns, as the repro shows.

The fix is non-trivial as we need to take parent columns' validity into account when parsing nested null streams.
There is a simple patch we can apply to throw an exception instead of corrupting data.

@firestarman
Copy link
Contributor Author

Thanks.
Can the final fix be made and included in the v21.08 Release ?

@rgsl888prabhu
Copy link
Contributor

I will try to get the fix with-in 21.08, but since the fix is non-trivial, I am not 100% sure.

@beckernick beckernick added this to the IO Data Type Expansion milestone Jul 16, 2021
@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Jul 20, 2021

Update:

In case of Pyorc, Pyarrow and liborc:

If the parent has a null element, that element is skipped while writing child data, and same goes with mask
So, you would have to keep track of null count and null mask in parent element, so that you can merge both the parent and child null masks.

As of now I have made changes that consider above expectation, and exploring how to tackle below expectation.


In case of pyspark, spark:

If the parent has a null element, and if child column also has null element, then upper explanation holds.
But if all the child column is valid, then it will have exact same mask as parent, so you wouldn't require to merge or do anything for that

But pyorc is able to read both variants properly, so currently diving through liborc to find exact approach being used. (edited) 

@rgsl888prabhu
Copy link
Contributor

Ready for review #8819

rapids-bot bot pushed a commit that referenced this issue Jul 22, 2021
In case of liborc, pyarrow and pyorc:
If the parent has a null element, that element is skipped while writing child data, and same goes with mask
So, you would have to keep track of null count and null mask in parent column, so that you can merge both the parent and child null masks.

In case of pyspark, spark:

If the parent has a null element, and if child column also has null element, then upper explanation holds.
But if all the child rows are valid, then you need to copy the mask from parent.

These scenarios have been take care in the code changes.

Earlier struct  column and its child columns used to be in the same level of nesting, but since we need parent null mask before decoding child,  changes have been made so that child columns will be moved one level down for all types of nested columns. 

closes #8704

Authors:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Devavret Makkar (https://github.com/devavret)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #8819
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants