[BUG] String case modification alignment #3132

rwlee · 2019-10-16T17:44:38Z

NVString case modification (upper and lower) does not align with spark sql cpu functionality. Attached are the diffs from UTF-8 characters and UTF-16 characters. The alignment is inconsistent, some case modification work on CPU and some work on GPU, so neither method is a superset of the other's functionality. GPU specifically seems to struggle with compound characters (like http://www.fileformat.info/info/unicode/char/0149/index.htm)

Recreate by running NVString upper/lower functions on all utf-8 and utf-16 characters. Compare to spark sql's upper and lower results with the same characters as a ground truth.

Can we clarify if this discrepancy is an issue for the spark use case? Is the 1 UTF-8 difference more concerning than the few hundred UTF-16 differences?

Tagging to start the discussion @sameerz @jlowe @revans2 @harrism

utf8diff.txt
utf16diff.txt

davidwendt · 2019-10-16T20:07:51Z

NVStrings only supports UTF-8 encoded char arrays. Case conversion is done using a lookup table that spans only the first 65K unicode space. I think the ŉ character mentioned in the link should work.

I'm not following what the contents of the attached files are supposed to mean.
Here is the utf8diff.txt in it's entirety:

3c3
< GPU:
---
> CPU:
239c239
< [S],
---
> [SS],
271,272c271
< [Ÿ])
< (SparkQueryCompareTestSuite.scala:370)^[[0m
---
> [Ÿ])^[[0m

Maybe someone can explain this file?
Are there problems with other UTF-8 characters shown in this file?

There is no support for accepting or producing UTF-16 in NVStrings.

A simple example using the nvstrings python interface could be helpful. For example, I ran this on my machine:

>>> import nvstrings
>>> s = nvstrings.to_device(['ŉ','ʼN'])
>>> print(s)
['ŉ', 'ʼN']
>>> print(s.lower())
['ŉ', 'ʼn']
>>> print(s.upper())
['ʼ', 'ʼN']

There is a print() method on the NVStrings C++ class as well if you don't want to use python.

davidwendt · 2019-10-16T21:43:57Z

So it appears from the http://www.fileformat.info/info/unicode/char/0149/index.htm that the single UTF-8 character ŉ should be converted to two individual UTF-8 characters ʼ and N.

Converting single characters to multiple characters is not supported by NVStrings.
This is documented here: https://rapidsai.github.io/projects/nvstrings/en/0.9.0/unicode.html

rwlee · 2019-10-16T22:31:01Z

utf8diff.txt -- that's the utf-8 characters (0-256). The files I diff'ed weren't 100% clean, so a few extra things were included (mainly [Ÿ] showing up as a diff, even though it is actually matching). Baseline file is GPU, the diffs with < as a prefix are what's present in the GPU file.

< GPU:
---
> CPU:
239c239
< [S],
---
> [SS],
271,272c271
< [Ÿ])
< (SparkQueryCompareTestSuite.scala:370)^[[0m
---
> [Ÿ])^[[0m

The S and SS is the (U+00DF) character in upper and lowercase form. The GPU version is converting to S and the CPU version is converting to SS -- http://www.fileformat.info/info/unicode/char/00df/index.htm. Since this is a 1 --> 2 letter conversion, it is not supported per the doc you linked.

rwlee · 2019-10-16T22:36:11Z

Will the case conversion for multiple characters ever be supported? It's an issue if we're claiming to support UTF-8.

There is no support for accepting or producing UTF-16 in NVStrings.

and https://rapidsai.github.io/projects/nvstrings/en/0.9.0/unicode.html contradict each other. I'm a little confused.

davidwendt · 2019-10-16T22:54:27Z

The page is probably poorly worded. Essentially for case conversion of UTF-8 characters only the characters that span the first 65K unicode space are supported.

rwlee · 2019-10-16T23:01:45Z

Will the string case conversion with from single to multiple characters ever be supported? We'll work around it by documenting on the spark side or disabling the gpu operation by default.

davidwendt · 2019-10-16T23:30:00Z

There is no plan to support this right now. Unfortunately there is no simple fix. These types of characters are not organized in any efficient manner so we can only add if-checks for each individual special character which would slow down the entire function for all characters.

jlowe · 2019-10-18T13:02:17Z

Correctness trumps a lot of performance. 😄

Without a correct implementation there are two choices: either document the limitation or avoid using these methods. The former relies on users reading the docs and noticing the limitation to avoid silent data corruption. The latter kills performance, since the data needs to be pulled off the GPU, converted from columns to rows, processed on the CPU, converted back to columnar, and put back on the GPU for further processing. The performance of a correct GPU method would have to be pretty slow before that process is faster.

we can only add if-checks for each individual special character

Typically I've seen this implemented either as a hash table or a sorted table with binary search lookup. The contents of the lookup table are known at build time, so a perfect hash table can be generated.

which would slow down the entire function for all characters

Agreed. This will necessarily be slower than the current implementation because in addition to the table lookup, I assume it would be implemented in two passes. The first pass would calculate the necessary output size, and the second pass would perform the conversion. One way to mitigate the performance concern wrt. the current implementation is to preserve the current function and add the slow-but-correct form as a separate method. That lets the caller decide if they need the correctness or instead know enough about the input data to leverage the fast version.

rwlee · 2019-11-14T04:22:38Z

Integer values for all characters that are currently inconsistent for the upper case operator

223, 329, 453, 456, 459, 496, 498, 604, 609, 618, 620, 647, 669, 670, 912, 944, 1011, 1321, 1323, 1325, 1327, 1415, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7830, 7831, 7832, 7833, 7834, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8072, 8073, 8074, 8075, 8076, 8077, 8078, 8079, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8088, 8089, 8090, 8091, 8092, 8093, 8094, 8095, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8104, 8105, 8106, 8107, 8108, 8109, 8110, 8111, 8114, 8115, 8116, 8118, 8119, 8124, 8130, 8131, 8132, 8134, 8135, 8140, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8188, 42649, 42651, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64275, 64276, 64277, 64278, 64279

Integer values for all characters that are currently inconsistent for the lower case operator

304, 453, 456, 459, 498, 895, 1320, 1322, 1324, 1326, 5024, 5025, 5026, 5027, 5028, 5029, 5030, 5031, 5032, 5033, 5034, 5035, 5036, 5037, 5038, 5039, 5040, 5041, 5042, 5043, 5044, 5045, 5046, 5047, 5048, 5049, 5050, 5051, 5052, 5053, 5054, 5055, 5056, 5057, 5058, 5059, 5060, 5061, 5062, 5063, 5064, 5065, 5066, 5067, 5068, 5069, 5070, 5071, 5072, 5073, 5074, 5075, 5076, 5077, 5078, 5079, 5080, 5081, 5082, 5083, 5084, 5085, 5086, 5087, 5088, 5089, 5090, 5091, 5092, 5093, 5094, 5095, 5096, 5097, 5098, 5099, 5100, 5101, 5102, 5103, 5104, 5105, 5106, 5107, 5108, 5109, 8072, 8073, 8074, 8075, 8076, 8077, 8078, 8079, 8088, 8089, 8090, 8091, 8092, 8093, 8094, 8095, 8104, 8105, 8106, 8107, 8108, 8109, 8110, 8111, 8124, 8140, 8188, 42648, 42650, 42902, 42904, 42906, 42908, 42910, 42923, 42924, 42925, 42926, 42928, 42929, 42930, 42931, 42932, 42934

nvdbaranec · 2020-02-26T21:52:55Z

Dredging this up. How far do we want to go with this? Sounds like the gold standard for this is a library called ICU, and they say:

case mapping can change the number of code points and/or code units of a string,
is language-sensitive (results may differ depending on language), and
is context-sensitive (a character in the input string may map differently depending on surrounding characters).

That makes this considerably trickier than a single lookup table. I think we'd have to at least handle bullet 2, which I think translates loosely to locale.

nvdbaranec · 2020-02-26T22:00:57Z

unicode.org provides a couple of handy text files which specify all the possible mappings. Maybe we could leverage this.

https://unicode.org/faq/casemap_charprop.html

Looking at the internals more closely, maybe we could just ignore the language and context-sensitive parts. There's not many of them and they are limited to Lithuanian, Turkish and Azeri.

davidwendt · 2020-02-26T22:10:08Z

My intention was to attack this in smaller steps. In the current implementation there is a simple lookup table that accounts for a large percentage of the unicode code-point space. And being unicode this is mostly locale independent. There are some holes in the current table which this issue highlights.
So I think a good approach is handle some of these holes and try to be local independent as long as possible. And keep working to find better ways handle the special cases.
Also, meanwhile there is the work on libcu++ which may include functions we can use too for this in the future.

rwlee added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Oct 16, 2019

kkraus14 removed the Needs Triage Need team to review and classify label Oct 18, 2019

nvdbaranec mentioned this issue Mar 16, 2020

[REVIEW] Add support for single->multi char string case mapping. Fix several single -> single mappings. #4520

Merged

davidwendt assigned nvdbaranec Mar 30, 2020

harrism closed this as completed in #4520 Apr 6, 2020

jlowe mentioned this issue Jun 29, 2021

Replace toTitle with capitalize for GpuInitCap NVIDIA/spark-rapids#2838

Merged

firestarman mentioned this issue Jul 2, 2021

[BUG] capitalize does not work correctly for the character ŉ . #8644

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] String case modification alignment #3132

[BUG] String case modification alignment #3132

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019 •

edited

Loading

davidwendt commented Oct 16, 2019

rwlee commented Oct 16, 2019

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019

jlowe commented Oct 18, 2019

rwlee commented Nov 14, 2019

nvdbaranec commented Feb 26, 2020

nvdbaranec commented Feb 26, 2020 •

edited

Loading

davidwendt commented Feb 26, 2020

[BUG] String case modification alignment #3132

[BUG] String case modification alignment #3132

Comments

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019 • edited Loading

davidwendt commented Oct 16, 2019

rwlee commented Oct 16, 2019

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019

rwlee commented Oct 16, 2019

davidwendt commented Oct 16, 2019

jlowe commented Oct 18, 2019

rwlee commented Nov 14, 2019

nvdbaranec commented Feb 26, 2020

nvdbaranec commented Feb 26, 2020 • edited Loading

davidwendt commented Feb 26, 2020

davidwendt commented Oct 16, 2019 •

edited

Loading

nvdbaranec commented Feb 26, 2020 •

edited

Loading