Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match fails for Turkish Locale due to Turkish "i" problem #33

Open
mdahamiwal opened this issue Aug 18, 2017 · 7 comments
Open

Match fails for Turkish Locale due to Turkish "i" problem #33

mdahamiwal opened this issue Aug 18, 2017 · 7 comments

Comments

@mdahamiwal
Copy link
Collaborator

Its a common problem in programming while comparing string in Turkish language eg.

Turkish language has four i's including case:
"ı" (dotless lowercase) & "I" (dotless uppercase)
"i" (dotted lowercase) & "İ" (dotted uppercase)

The library internally uses toLowerCase() and toUpperCase() , this results in mismatch if query
or candidates contains dotless is as
"I".toLowerCase() = "i"

What we need is
"I".toLocaleLowerCase() = "ı"

Fix: Use toLocaleLowerCase() to ensure that turkish locale is honored while lowercase/uppercase conversions.

@jeancroy
Copy link
Owner

jeancroy commented Aug 18, 2017

Question
Turkish user will never try to user "i" (dotted lowercase) as a lazy selector for "I" (dot less uppercase) ?

I ask because (at least in the case of atom) the library is focused on programming language & path.
Someone with Turkish locale may still need to deal with a InstrumentationManager class & the like.
I'm a bit hesitant to having a fix that would break English use case.

In case of doubt, I prefer to match sightly too often. In this case lowercase "i" could match both uppercase "I". That would support mixed language use.

I would be ok with having user defined "lowercaseFunc(x) => x.toLocaleLowercase()" and similar "uppercaseFunc" if you are certain your user won't mix languages.

@mdahamiwal
Copy link
Collaborator Author

Nope, if the current locale is Turkish, the two i s are considered different. That said, user won't expect "i" to match "I". If user is looking for results matching "I", they would use locale specific characters. Otherwise, there is no way to handle these scenarios well in code. I can re-check with internal testers in my team but I pretty sure of the behavior. Thanks

@jeancroy
Copy link
Owner

Basically the question is do Turkish people have different expectation for Turkish text and English text.
It's very possible your suggestion is the way to go. If so I'll probably merge and cut a version soon.

@mdahamiwal
Copy link
Collaborator Author

I confirmed with an actual Turkish user and he mentioned that you should only take care of case insensitive match, as we can never know if search term is Turkish or English. So, as far as we are using toLocaleLowerCase() we should be covered.

@jeancroy
Copy link
Owner

jeancroy commented Aug 21, 2017

you should only take care of case insensitive match, as we can never know if search term is Turkish or English

I'm not sure I understand that sentence. toLocaleLowerCase will both enable case-intensive Turkish match i<>İ, ı<>I and break case-insensitive English ones i<>I.

Do you mean case sensitive instead ? (If one never know the language, one does not apply language specific transform, I think this one works out of the box now)

Are you OK with this being an option ?
Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

I feel like the reasonable thing to do it letting more user interact with it and see if they like it.
I may have locale enabled by default, do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

@mdahamiwal
Copy link
Collaborator Author

that you should only take care of case insensitive match, as we can never know if search term is Turkish or English.

That means we should take care of matching for current locale (i<>I needn't match on an Turkish locale but i<>İ, ı<>I should). So, using toLocaleLowerCase() or toLocaleUpperCase() should cover it.

Are you OK with this being an option ?
Something like : fuzzaldrin.filter: (candidates, query, {localeCase:true})

Sounds good to me.

do you know if it's much slower ? Or it's all pre-computed table and similar speed ?

That brings out a good point, I did a benchmark run on chrome and Edge and toLocaleLowerCase turned out to be 50-60% slower for full strings. https://jsperf.com/localeowercase-vs-lowercase

However, as per our logic, the only place we do full string toLowerCase is one time in query and before scoring the matched strings. So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

@jeancroy
Copy link
Owner

So, that doesn't seem to be regressing the perf in most of the cases. thoughts ?

That's the spirit. Some user will type "nondescript" query, as part of word, or by taping the name of a folder with large sub-hierarchy. In the past I've toyed with the idea of writing my own case-insensitive "indexOf" and getting rid of transforming the subject. Alternatively some cache of lowercase subjects may make the problem disappear over multiple search.


Per this file of exception in unicode case folding
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
Composite uppercase are handled by the truncated uppercase routine.
Conditional case folding would be handling by "toLocaleLower" etc.

Since this issue is related to a few language Turkish, Azeri, Lithuanian and is data dependent
I think I'll disable locale-lowercase by default, and enable user to opt in.

--

I'll be working on making your changes option-dependent later today or tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants