[BUG] ORC writer writes corrupted data. #8514

rgsl888prabhu · 2021-06-14T23:57:59Z

Describe the bug
ORC writer writes corrupted data and when it is read through cudf, string column has junk value.

Steps/Code to reproduce bug

import cudf
import decimal
size = 9920
val = {str(x): [decimal.Decimal(0)]*size if x != 0 else ["0"]*size for x in range(0, 5)}
df = cudf.DataFrame(val)
df.to_orc("sample.orc")
gdf=cudf.read_orc("semi.orc")

df

      0  1  2  3  4
0     0  0  0  0  0
1     0  0  0  0  0
2     0  0  0  0  0
3     0  0  0  0  0
4     0  0  0  0  0
...  .. .. .. .. ..
9915  0  0  0  0  0
9916  0  0  0  0  0
9917  0  0  0  0  0
9918  0  0  0  0  0
9919  0  0  0  0  0

[9920 rows x 5 columns]

gdf

     0   1  2  3  4
0        0  0  0  0
1        0  0  0  0
2        0  0  0  0
3        0  0  0  0
4        0  0  0  0
...  .. .. .. .. ..
9915     0  0  0  0
9916     0  0  0  0
9917     0  0  0  0
9918     0  0  0  0
9919     0  0  0  0

[9920 rows x 5 columns]

Expected behavior
Should match what original dataframe had

Environment overview (please complete the following information)

Method of cuDF install: [conda, Docker, or from source]

The text was updated successfully, but these errors were encountered:

Fixes #8514 String dictionary length is RLE encoded and `rle_data_size` and `non_rle_data_size` take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and `non_rle_data_size` was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams. Applied the `non_rle_data_size` to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - Charles Blackmon-Luca (https://github.com/charlesbluca) - David Wendt (https://github.com/davidwendt) URL: #8515

Fixes #8514 String dictionary length is RLE encoded and rle_data_size and non_rle_data_size take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and non_rle_data_size was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams. Applied the non_rle_data_size to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow. Author: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Nghia Truong (https://github.com/ttnghia) - Conor Hoekstra (https://github.com/codereport) - David Wendt (https://github.com/davidwendt) - https://github.com/brandon-b-miller URL: #8538

rgsl888prabhu added bug Something isn't working cuIO cuIO issue labels Jun 14, 2021

rgsl888prabhu assigned vuule Jun 14, 2021

vuule mentioned this issue Jun 15, 2021

Fix offset of the string dictionary length stream #8515

Merged

rapids-bot bot closed this as completed in #8515 Jun 16, 2021

vuule mentioned this issue Jun 16, 2021

Fix offset of the string dictionary length stream #8538

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC writer writes corrupted data. #8514

[BUG] ORC writer writes corrupted data. #8514

rgsl888prabhu commented Jun 14, 2021

[BUG] ORC writer writes corrupted data. #8514

[BUG] ORC writer writes corrupted data. #8514

Comments

rgsl888prabhu commented Jun 14, 2021