Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC writer writes corrupted data. #8514

Closed
rgsl888prabhu opened this issue Jun 14, 2021 · 0 comments · Fixed by #8515
Closed

[BUG] ORC writer writes corrupted data. #8514

rgsl888prabhu opened this issue Jun 14, 2021 · 0 comments · Fixed by #8515
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@rgsl888prabhu
Copy link
Contributor

Describe the bug
ORC writer writes corrupted data and when it is read through cudf, string column has junk value.

Steps/Code to reproduce bug

import cudf
import decimal
size = 9920
val = {str(x): [decimal.Decimal(0)]*size if x != 0 else ["0"]*size for x in range(0, 5)}
df = cudf.DataFrame(val)
df.to_orc("sample.orc")
gdf=cudf.read_orc("semi.orc")

df

      0  1  2  3  4
0     0  0  0  0  0
1     0  0  0  0  0
2     0  0  0  0  0
3     0  0  0  0  0
4     0  0  0  0  0
...  .. .. .. .. ..
9915  0  0  0  0  0
9916  0  0  0  0  0
9917  0  0  0  0  0
9918  0  0  0  0  0
9919  0  0  0  0  0

[9920 rows x 5 columns]

gdf

     0   1  2  3  4
0        0  0  0  0
1        0  0  0  0
2        0  0  0  0
3        0  0  0  0
4        0  0  0  0
...  .. .. .. .. ..
9915     0  0  0  0
9916     0  0  0  0
9917     0  0  0  0
9918     0  0  0  0
9919     0  0  0  0

[9920 rows x 5 columns]

Expected behavior
Should match what original dataframe had

Environment overview (please complete the following information)

  • Method of cuDF install: [conda, Docker, or from source]
@rgsl888prabhu rgsl888prabhu added bug Something isn't working cuIO cuIO issue labels Jun 14, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 16, 2021
Fixes #8514

String dictionary length is RLE encoded and `rle_data_size` and `non_rle_data_size` take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and `non_rle_data_size` was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams.

Applied the `non_rle_data_size` to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)
  - David Wendt (https://github.com/davidwendt)

URL: #8515
ajschmidt8 pushed a commit that referenced this issue Jun 17, 2021
Fixes #8514

String dictionary length is RLE encoded and rle_data_size and non_rle_data_size take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and non_rle_data_size was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams.

Applied the non_rle_data_size to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow.

Author:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Conor Hoekstra (https://github.com/codereport)
  - David Wendt (https://github.com/davidwendt)
  - https://github.com/brandon-b-miller

URL: #8538
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants