You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An error in #7324 caused ORC writer to prefer direct encoding to dictionary for string columns. The PR #7737 fixes this issue but uncovers a regression that was hidden due to wrong encoding type.
From local testing it looks like the issue originates in #7676.
Sample test output:
first difference: lhs[7] = Funday, rhs[7] = ��nday
first difference: lhs[0] = a long string to make sure overflow affects the output, rhs[0] = a long��tring to make sure overflow affects the output
FWIW, these strings should be the last entries in the dictionary.
The text was updated successfully, but these errors were encountered:
In PR #7676 the length of the current string being referred to while building stripe dictionaries was always set to 0 while incrementing the dictionary character count of a StripeDictionary. This led to corrupted strings when the dictionary encoding was used as noted in issue #7741. This has been fixed in this PR.
Fixes#7741
Authors:
- Kumar Aatish (@kaatish)
Approvers:
- Vukasin Milovanovic (@vuule)
- Nghia Truong (@ttnghia)
URL: #7744
An error in #7324 caused ORC writer to prefer direct encoding to dictionary for string columns. The PR #7737 fixes this issue but uncovers a regression that was hidden due to wrong encoding type.
From local testing it looks like the issue originates in #7676.
Sample test output:
FWIW, these strings should be the last entries in the dictionary.
The text was updated successfully, but these errors were encountered: