Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] String case modification alignment #3132

Closed
rwlee opened this issue Oct 16, 2019 · 12 comments · Fixed by #4520
Closed

[BUG] String case modification alignment #3132

rwlee opened this issue Oct 16, 2019 · 12 comments · Fixed by #4520
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@rwlee
Copy link
Contributor

rwlee commented Oct 16, 2019

NVString case modification (upper and lower) does not align with spark sql cpu functionality. Attached are the diffs from UTF-8 characters and UTF-16 characters. The alignment is inconsistent, some case modification work on CPU and some work on GPU, so neither method is a superset of the other's functionality. GPU specifically seems to struggle with compound characters (like http://www.fileformat.info/info/unicode/char/0149/index.htm)

Recreate by running NVString upper/lower functions on all utf-8 and utf-16 characters. Compare to spark sql's upper and lower results with the same characters as a ground truth.

Can we clarify if this discrepancy is an issue for the spark use case? Is the 1 UTF-8 difference more concerning than the few hundred UTF-16 differences?

Tagging to start the discussion @sameerz @jlowe @revans2 @harrism

utf8diff.txt
utf16diff.txt

@rwlee rwlee added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Oct 16, 2019
@davidwendt
Copy link
Contributor

davidwendt commented Oct 16, 2019

NVStrings only supports UTF-8 encoded char arrays. Case conversion is done using a lookup table that spans only the first 65K unicode space. I think the ʼn character mentioned in the link should work.

I'm not following what the contents of the attached files are supposed to mean.
Here is the utf8diff.txt in it's entirety:

3c3
< GPU:
---
> CPU:
239c239
< [S],
---
> [SS],
271,272c271
< [Ÿ])
< (SparkQueryCompareTestSuite.scala:370)^[[0m
---
> [Ÿ])^[[0m

Maybe someone can explain this file?
Are there problems with other UTF-8 characters shown in this file?

There is no support for accepting or producing UTF-16 in NVStrings.

A simple example using the nvstrings python interface could be helpful. For example, I ran this on my machine:

>>> import nvstrings
>>> s = nvstrings.to_device(['ʼn','ʼN'])
>>> print(s)
['ʼn', 'ʼN']
>>> print(s.lower())
['ʼn', 'ʼn']
>>> print(s.upper())
['ʼ', 'ʼN']

There is a print() method on the NVStrings C++ class as well if you don't want to use python.

@davidwendt
Copy link
Contributor

So it appears from the http://www.fileformat.info/info/unicode/char/0149/index.htm that the single UTF-8 character ʼn should be converted to two individual UTF-8 characters ʼ and N.

Converting single characters to multiple characters is not supported by NVStrings.
This is documented here: https://rapidsai.github.io/projects/nvstrings/en/0.9.0/unicode.html

@rwlee
Copy link
Contributor Author

rwlee commented Oct 16, 2019

utf8diff.txt -- that's the utf-8 characters (0-256). The files I diff'ed weren't 100% clean, so a few extra things were included (mainly [Ÿ] showing up as a diff, even though it is actually matching). Baseline file is GPU, the diffs with < as a prefix are what's present in the GPU file.

< GPU:
---
> CPU:
239c239
< [S],
---
> [SS],
271,272c271
< [Ÿ])
< (SparkQueryCompareTestSuite.scala:370)^[[0m
---
> [Ÿ])^[[0m

The S and SS is the (U+00DF) character in upper and lowercase form. The GPU version is converting to S and the CPU version is converting to SS -- http://www.fileformat.info/info/unicode/char/00df/index.htm. Since this is a 1 --> 2 letter conversion, it is not supported per the doc you linked.

@rwlee
Copy link
Contributor Author

rwlee commented Oct 16, 2019

Will the case conversion for multiple characters ever be supported? It's an issue if we're claiming to support UTF-8.

There is no support for accepting or producing UTF-16 in NVStrings.

and https://rapidsai.github.io/projects/nvstrings/en/0.9.0/unicode.html contradict each other. I'm a little confused.

@davidwendt
Copy link
Contributor

The page is probably poorly worded. Essentially for case conversion of UTF-8 characters only the characters that span the first 65K unicode space are supported.

@rwlee
Copy link
Contributor Author

rwlee commented Oct 16, 2019

Will the string case conversion with from single to multiple characters ever be supported? We'll work around it by documenting on the spark side or disabling the gpu operation by default.

@davidwendt
Copy link
Contributor

There is no plan to support this right now. Unfortunately there is no simple fix. These types of characters are not organized in any efficient manner so we can only add if-checks for each individual special character which would slow down the entire function for all characters.

@jlowe
Copy link
Member

jlowe commented Oct 18, 2019

Correctness trumps a lot of performance. 😄

Without a correct implementation there are two choices: either document the limitation or avoid using these methods. The former relies on users reading the docs and noticing the limitation to avoid silent data corruption. The latter kills performance, since the data needs to be pulled off the GPU, converted from columns to rows, processed on the CPU, converted back to columnar, and put back on the GPU for further processing. The performance of a correct GPU method would have to be pretty slow before that process is faster.

we can only add if-checks for each individual special character

Typically I've seen this implemented either as a hash table or a sorted table with binary search lookup. The contents of the lookup table are known at build time, so a perfect hash table can be generated.

which would slow down the entire function for all characters

Agreed. This will necessarily be slower than the current implementation because in addition to the table lookup, I assume it would be implemented in two passes. The first pass would calculate the necessary output size, and the second pass would perform the conversion. One way to mitigate the performance concern wrt. the current implementation is to preserve the current function and add the slow-but-correct form as a separate method. That lets the caller decide if they need the correctness or instead know enough about the input data to leverage the fast version.

@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Oct 18, 2019
@rwlee
Copy link
Contributor Author

rwlee commented Nov 14, 2019

Integer values for all characters that are currently inconsistent for the upper case operator

223, 329, 453, 456, 459, 496, 498, 604, 609, 618, 620, 647, 669, 670, 912, 944, 1011, 1321, 1323, 1325, 1327, 1415, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7830, 7831, 7832, 7833, 7834, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8072, 8073, 8074, 8075, 8076, 8077, 8078, 8079, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8088, 8089, 8090, 8091, 8092, 8093, 8094, 8095, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8104, 8105, 8106, 8107, 8108, 8109, 8110, 8111, 8114, 8115, 8116, 8118, 8119, 8124, 8130, 8131, 8132, 8134, 8135, 8140, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8188, 42649, 42651, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64275, 64276, 64277, 64278, 64279

Integer values for all characters that are currently inconsistent for the lower case operator

304, 453, 456, 459, 498, 895, 1320, 1322, 1324, 1326, 5024, 5025, 5026, 5027, 5028, 5029, 5030, 5031, 5032, 5033, 5034, 5035, 5036, 5037, 5038, 5039, 5040, 5041, 5042, 5043, 5044, 5045, 5046, 5047, 5048, 5049, 5050, 5051, 5052, 5053, 5054, 5055, 5056, 5057, 5058, 5059, 5060, 5061, 5062, 5063, 5064, 5065, 5066, 5067, 5068, 5069, 5070, 5071, 5072, 5073, 5074, 5075, 5076, 5077, 5078, 5079, 5080, 5081, 5082, 5083, 5084, 5085, 5086, 5087, 5088, 5089, 5090, 5091, 5092, 5093, 5094, 5095, 5096, 5097, 5098, 5099, 5100, 5101, 5102, 5103, 5104, 5105, 5106, 5107, 5108, 5109, 8072, 8073, 8074, 8075, 8076, 8077, 8078, 8079, 8088, 8089, 8090, 8091, 8092, 8093, 8094, 8095, 8104, 8105, 8106, 8107, 8108, 8109, 8110, 8111, 8124, 8140, 8188, 42648, 42650, 42902, 42904, 42906, 42908, 42910, 42923, 42924, 42925, 42926, 42928, 42929, 42930, 42931, 42932, 42934

@nvdbaranec
Copy link
Contributor

Dredging this up. How far do we want to go with this? Sounds like the gold standard for this is a library called ICU, and they say:

  • case mapping can change the number of code points and/or code units of a string,
  • is language-sensitive (results may differ depending on language), and
  • is context-sensitive (a character in the input string may map differently depending on surrounding characters).

That makes this considerably trickier than a single lookup table. I think we'd have to at least handle bullet 2, which I think translates loosely to locale.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Feb 26, 2020

unicode.org provides a couple of handy text files which specify all the possible mappings. Maybe we could leverage this.

https://unicode.org/faq/casemap_charprop.html

Looking at the internals more closely, maybe we could just ignore the language and context-sensitive parts. There's not many of them and they are limited to Lithuanian, Turkish and Azeri.

@davidwendt
Copy link
Contributor

My intention was to attack this in smaller steps. In the current implementation there is a simple lookup table that accounts for a large percentage of the unicode code-point space. And being unicode this is mostly locale independent. There are some holes in the current table which this issue highlights.
So I think a good approach is handle some of these holes and try to be local independent as long as possible. And keep working to find better ways handle the special cases.
Also, meanwhile there is the work on libcu++ which may include functions we can use too for this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
5 participants