Match fails for Turkish Locale due to Turkish "i" problem #33

mdahamiwal · 2017-08-18T11:09:45Z

Its a common problem in programming while comparing string in Turkish language eg.

Turkish language has four i's including case:
"ı" (dotless lowercase) & "I" (dotless uppercase)
"i" (dotted lowercase) & "İ" (dotted uppercase)

The library internally uses toLowerCase() and toUpperCase() , this results in mismatch if query
or candidates contains dotless is as
"I".toLowerCase() = "i"

What we need is
"I".toLocaleLowerCase() = "ı"

Fix: Use toLocaleLowerCase() to ensure that turkish locale is honored while lowercase/uppercase conversions.

The text was updated successfully, but these errors were encountered:

jeancroy · 2017-08-18T14:08:58Z

Question
Turkish user will never try to user "i" (dotted lowercase) as a lazy selector for "I" (dot less uppercase) ?

I ask because (at least in the case of atom) the library is focused on programming language & path.
Someone with Turkish locale may still need to deal with a InstrumentationManager class & the like.
I'm a bit hesitant to having a fix that would break English use case.

In case of doubt, I prefer to match sightly too often. In this case lowercase "i" could match both uppercase "I". That would support mixed language use.

I would be ok with having user defined "lowercaseFunc(x) => x.toLocaleLowercase()" and similar "uppercaseFunc" if you are certain your user won't mix languages.

mdahamiwal · 2017-08-19T05:22:11Z

Nope, if the current locale is Turkish, the two i s are considered different. That said, user won't expect "i" to match "I". If user is looking for results matching "I", they would use locale specific characters. Otherwise, there is no way to handle these scenarios well in code. I can re-check with internal testers in my team but I pretty sure of the behavior. Thanks

jeancroy · 2017-08-19T14:01:25Z

Basically the question is do Turkish people have different expectation for Turkish text and English text.
It's very possible your suggestion is the way to go. If so I'll probably merge and cut a version soon.

mdahamiwal · 2017-08-21T08:44:52Z

I confirmed with an actual Turkish user and he mentioned that you should only take care of case insensitive match, as we can never know if search term is Turkish or English. So, as far as we are using toLocaleLowerCase() we should be covered.

jeancroy · 2017-08-21T13:49:48Z

you should only take care of case insensitive match, as we can never know if search term is Turkish or English

I'm not sure I understand that sentence. toLocaleLowerCase will both enable case-intensive Turkish match i<>İ, ı<>I and break case-insensitive English ones i<>I.

Do you mean case sensitive instead ? (If one never know the language, one does not apply language specific transform, I think this one works out of the box now)

Are you OK with this being an option ?
Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

I feel like the reasonable thing to do it letting more user interact with it and see if they like it.
I may have locale enabled by default, do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

mdahamiwal · 2017-08-22T06:02:44Z

that you should only take care of case insensitive match, as we can never know if search term is Turkish or English.

That means we should take care of matching for current locale (i<>I needn't match on an Turkish locale but i<>İ, ı<>I should). So, using toLocaleLowerCase() or toLocaleUpperCase() should cover it.

Are you OK with this being an option ?
Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

Sounds good to me.

do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

That brings out a good point, I did a benchmark run on chrome and Edge and toLocaleLowerCase turned out to be 50-60% slower for full strings. https://jsperf.com/localeowercase-vs-lowercase

However, as per our logic, the only place we do full string toLowerCase is one time in query and before scoring the matched strings. So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

jeancroy · 2017-08-22T12:42:08Z

So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

That's the spirit. Some user will type "nondescript" query, as part of word, or by taping the name of a folder with large sub-hierarchy. In the past I've toyed with the idea of writing my own case-insensitive "indexOf" and getting rid of transforming the subject. Alternatively some cache of lowercase subjects may make the problem disappear over multiple search.

Per this file of exception in unicode case folding
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
Composite uppercase are handled by the truncated uppercase routine.
Conditional case folding would be handling by "toLocaleLower" etc.

Since this issue is related to a few language Turkish, Azeri, Lithuanian and is data dependent
I think I'll disable locale-lowercase by default, and enable user to opt in.

--

I'll be working on making your changes option-dependent later today or tomorrow.

mdahamiwal mentioned this issue Aug 18, 2017

Fixing matching for various locales (Turkish specifically) #34

Open

jeancroy mentioned this issue Sep 11, 2017

Need a new package incorporating PR #32 #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match fails for Turkish Locale due to Turkish "i" problem #33

Match fails for Turkish Locale due to Turkish "i" problem #33

mdahamiwal commented Aug 18, 2017

jeancroy commented Aug 18, 2017 •

edited

Loading

mdahamiwal commented Aug 19, 2017

jeancroy commented Aug 19, 2017

mdahamiwal commented Aug 21, 2017

jeancroy commented Aug 21, 2017 •

edited

Loading

mdahamiwal commented Aug 22, 2017

jeancroy commented Aug 22, 2017

Match fails for Turkish Locale due to Turkish "i" problem #33

Match fails for Turkish Locale due to Turkish "i" problem #33

Comments

mdahamiwal commented Aug 18, 2017

jeancroy commented Aug 18, 2017 • edited Loading

mdahamiwal commented Aug 19, 2017

jeancroy commented Aug 19, 2017

mdahamiwal commented Aug 21, 2017

jeancroy commented Aug 21, 2017 • edited Loading

mdahamiwal commented Aug 22, 2017

jeancroy commented Aug 22, 2017

jeancroy commented Aug 18, 2017 •

edited

Loading

jeancroy commented Aug 21, 2017 •

edited

Loading