[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

vuule · 2021-03-26T19:19:19Z

An error in #7324 caused ORC writer to prefer direct encoding to dictionary for string columns. The PR #7737 fixes this issue but uncovers a regression that was hidden due to wrong encoding type.
From local testing it looks like the issue originates in #7676.
Sample test output:

first difference: lhs[7] = Funday, rhs[7] = ��nday
first difference: lhs[0] = a long string to make sure overflow affects the output, rhs[0] = a long��tring to make sure overflow affects the output

FWIW, these strings should be the last entries in the dictionary.

The text was updated successfully, but these errors were encountered:

@kaatish

In PR #7676 the length of the current string being referred to while building stripe dictionaries was always set to 0 while incrementing the dictionary character count of a StripeDictionary. This led to corrupted strings when the dictionary encoding was used as noted in issue #7741. This has been fixed in this PR. Fixes #7741 Authors: - Kumar Aatish (@kaatish) Approvers: - Vukasin Milovanovic (@vuule) - Nghia Truong (@ttnghia) URL: #7744

vuule added bug Something isn't working cuIO cuIO issue labels Mar 26, 2021

vuule assigned kaatish Mar 26, 2021

kaatish mentioned this issue Mar 26, 2021

Fix string length in stripe dictionary building #7744

Merged

rapids-bot bot closed this as completed in #7744 Mar 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

vuule commented Mar 26, 2021

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

Comments

vuule commented Mar 26, 2021