Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

Closed
vuule opened this issue Mar 26, 2021 · 0 comments · Fixed by #7744
Closed

[BUG] ORC string dictionary encoding corrupts some characters in the column #7741

vuule opened this issue Mar 26, 2021 · 0 comments · Fixed by #7744
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@vuule
Copy link
Contributor

vuule commented Mar 26, 2021

An error in #7324 caused ORC writer to prefer direct encoding to dictionary for string columns. The PR #7737 fixes this issue but uncovers a regression that was hidden due to wrong encoding type.
From local testing it looks like the issue originates in #7676.
Sample test output:

first difference: lhs[7] = Funday, rhs[7] = ��nday
first difference: lhs[0] = a long string to make sure overflow affects the output, rhs[0] = a long��tring to make sure overflow affects the output

FWIW, these strings should be the last entries in the dictionary.

@vuule vuule added bug Something isn't working cuIO cuIO issue labels Mar 26, 2021
rapids-bot bot pushed a commit that referenced this issue Mar 27, 2021
In PR #7676 the length of the current string being referred to while building stripe dictionaries was always set to 0 while incrementing the dictionary character count of a StripeDictionary. This led to corrupted strings when the dictionary encoding was used as noted in issue #7741. This has been fixed in this PR.

Fixes #7741

Authors:
  - Kumar Aatish (@kaatish)

Approvers:
  - Vukasin Milovanovic (@vuule)
  - Nghia Truong (@ttnghia)

URL: #7744
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants