-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] String case modification alignment #3132
Comments
NVStrings only supports UTF-8 encoded char arrays. Case conversion is done using a lookup table that spans only the first 65K unicode space. I think the ʼn character mentioned in the link should work. I'm not following what the contents of the attached files are supposed to mean.
Maybe someone can explain this file? There is no support for accepting or producing UTF-16 in NVStrings. A simple example using the
There is a |
So it appears from the http://www.fileformat.info/info/unicode/char/0149/index.htm that the single UTF-8 character ʼn should be converted to two individual UTF-8 characters ʼ and N. Converting single characters to multiple characters is not supported by NVStrings. |
utf8diff.txt -- that's the utf-8 characters (0-256). The files I diff'ed weren't 100% clean, so a few extra things were included (mainly
The S and SS is the |
Will the case conversion for multiple characters ever be supported? It's an issue if we're claiming to support UTF-8.
and https://rapidsai.github.io/projects/nvstrings/en/0.9.0/unicode.html contradict each other. I'm a little confused. |
The page is probably poorly worded. Essentially for case conversion of UTF-8 characters only the characters that span the first 65K unicode space are supported. |
Will the string case conversion with from single to multiple characters ever be supported? We'll work around it by documenting on the spark side or disabling the gpu operation by default. |
There is no plan to support this right now. Unfortunately there is no simple fix. These types of characters are not organized in any efficient manner so we can only add if-checks for each individual special character which would slow down the entire function for all characters. |
Correctness trumps a lot of performance. 😄 Without a correct implementation there are two choices: either document the limitation or avoid using these methods. The former relies on users reading the docs and noticing the limitation to avoid silent data corruption. The latter kills performance, since the data needs to be pulled off the GPU, converted from columns to rows, processed on the CPU, converted back to columnar, and put back on the GPU for further processing. The performance of a correct GPU method would have to be pretty slow before that process is faster.
Typically I've seen this implemented either as a hash table or a sorted table with binary search lookup. The contents of the lookup table are known at build time, so a perfect hash table can be generated.
Agreed. This will necessarily be slower than the current implementation because in addition to the table lookup, I assume it would be implemented in two passes. The first pass would calculate the necessary output size, and the second pass would perform the conversion. One way to mitigate the performance concern wrt. the current implementation is to preserve the current function and add the slow-but-correct form as a separate method. That lets the caller decide if they need the correctness or instead know enough about the input data to leverage the fast version. |
Integer values for all characters that are currently inconsistent for the upper case operator
Integer values for all characters that are currently inconsistent for the lower case operator
|
Dredging this up. How far do we want to go with this? Sounds like the gold standard for this is a library called ICU, and they say:
That makes this considerably trickier than a single lookup table. I think we'd have to at least handle bullet 2, which I think translates loosely to locale. |
unicode.org provides a couple of handy text files which specify all the possible mappings. Maybe we could leverage this. https://unicode.org/faq/casemap_charprop.html Looking at the internals more closely, maybe we could just ignore the language and context-sensitive parts. There's not many of them and they are limited to Lithuanian, Turkish and Azeri. |
My intention was to attack this in smaller steps. In the current implementation there is a simple lookup table that accounts for a large percentage of the unicode code-point space. And being unicode this is mostly locale independent. There are some holes in the current table which this issue highlights. |
NVString case modification (upper and lower) does not align with spark sql cpu functionality. Attached are the diffs from UTF-8 characters and UTF-16 characters. The alignment is inconsistent, some case modification work on CPU and some work on GPU, so neither method is a superset of the other's functionality. GPU specifically seems to struggle with compound characters (like http://www.fileformat.info/info/unicode/char/0149/index.htm)
Recreate by running NVString upper/lower functions on all utf-8 and utf-16 characters. Compare to spark sql's upper and lower results with the same characters as a ground truth.
Can we clarify if this discrepancy is an issue for the spark use case? Is the 1 UTF-8 difference more concerning than the few hundred UTF-16 differences?
Tagging to start the discussion @sameerz @jlowe @revans2 @harrism
utf8diff.txt
utf16diff.txt
The text was updated successfully, but these errors were encountered: