You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
ORC writer writes corrupted data and when it is read through cudf, string column has junk value.
Steps/Code to reproduce bug
import cudf
import decimal
size = 9920
val = {str(x): [decimal.Decimal(0)]*size if x != 0 else ["0"]*size for x in range(0, 5)}
df = cudf.DataFrame(val)
df.to_orc("sample.orc")
gdf=cudf.read_orc("semi.orc")
Fixes#8514
String dictionary length is RLE encoded and `rle_data_size` and `non_rle_data_size` take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and `non_rle_data_size` was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams.
Applied the `non_rle_data_size` to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow.
Authors:
- Vukasin Milovanovic (https://github.com/vuule)
Approvers:
- Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
- Charles Blackmon-Luca (https://github.com/charlesbluca)
- David Wendt (https://github.com/davidwendt)
URL: #8515
Fixes#8514
String dictionary length is RLE encoded and rle_data_size and non_rle_data_size take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and non_rle_data_size was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams.
Applied the non_rle_data_size to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow.
Author:
- Vukasin Milovanovic (https://github.com/vuule)
Approvers:
- Nghia Truong (https://github.com/ttnghia)
- Conor Hoekstra (https://github.com/codereport)
- David Wendt (https://github.com/davidwendt)
- https://github.com/brandon-b-miller
URL: #8538
Describe the bug
ORC writer writes corrupted data and when it is read through cudf, string column has junk value.
Steps/Code to reproduce bug
Expected behavior
Should match what original dataframe had
Environment overview (please complete the following information)
The text was updated successfully, but these errors were encountered: