Convert search terms to NFC #1244

longrunningprocess · 2021-11-11T14:50:41Z

There may be languages or scenarios where a user will enter a search term and either match/miss entries because of incompatible UNICODE encodings, e.g., NFC or NFD. I'm not 100% sure whether this problem exists in LF or not but thought it would be good to research this possibility more in case there are users out there working in languages and having a lot of difficulty as a result of LF not "normalizing" inputs against the data being searched against.

This thought and a possible situation demonstrating a failing scenario first came up in discussions here: #1243 (comment)

alex-larkin · 2021-11-11T17:31:08Z

Understanding how different ways of encoding characters and representing words affect language projects

I found through this post a link to the Character Identifier tool that Marc Durdin made. I think we should connect with him to understand this issue better.

alex-larkin · 2022-01-06T21:05:18Z

Talked with Billy over Meet and here are some thoughts:

Without failing use cases, this is still somewhat theoretical
We need to know how FLEx and Language Depot, LF encodes the data. Is NFC or NFD used?

Before going forward, we need

To talk with @megahirt about how data is encoded in LF (and LD and FLEx)
Find some failing use cases

One option for search purposes is to normalize both the search query and the data into a cache and then checking for matches. Although constantly normalizing the entire lexicon may tax the resources heavily.

megahirt · 2022-01-07T05:28:02Z

Some samples of NFD/NFC equivalent strings are in the comment this issue references. There are several examples. When doing a string comparison (byte comparison) you will not get matches for strings that look identical to the user because the underlying codepoints are not the same.

What is curious to me is that our code seems to treat an NFC and NFD string as equivalent so that an exact match search does the right thing. I am digging in further to try and understand.

~~LF does not normalize data on storage~~, however my experience tells me that most data as a whole is coming in as NFC. ~~We store whatever the user types using their keyboard~~. Many keyboards produce NFC, some produce NFD.

@jasonleenaylor does FLEx normalize data when it is stored on disk? I have a feeling it normalizes to NFD, but I'm not certain.

Here is a test character for searching in both NFC and NFD form:

â (NFC)
â (NFD)

These characters are canonically equivalent, look identical, yet contain different code points underneath and will not match when one is typed and the other is present in the data.

(screenshot from the Character Identifier tool)

The actual comparison happens on L478 of entryMeetsFilterCriteria()

Consider the following Regex which demonstrates that an NFD and NFC sequence do not match without special treatment:

// Is "â" (NFC) equal to "â" (NFD) ?
/\u00E2/.test('\u0061\u0302');
// no it is not

I am still investigating why the filter shows a match when the data and search term are not normalized. Update: figured it out. See next comment.

megahirt · 2022-01-07T13:50:33Z

LF does not normalize data on storage, however my experience tells me that most data as a whole is coming in as NFC. We store whatever the user types using their keyboard. Many keyboards produce NFC, some produce NFD.

I stand corrected. LF does normalize all strings to NFC on import or storage. Here is where it gets converted to NFC. This is why the strings match when I enter in NFD as a data field and then search using an NFC string. This becomes an NFC = NFC comparison.

If I enter an NFD string into the search criteria, the exact match doesn't work as expected, since I now understand what is going on.

This issue should now be defined as: "convert search term to NFC" when searching for exact match. That should do it.

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match queries to work regardless of form. Attempt to fix a bug where the default behavior of ignoring diacritics would cause missing search results for complex scripts with combining characters that are not diacritics (e.g. Japanese or Korean) fixes #1244

longrunningprocess · 2022-01-07T18:10:06Z

excellent findings, this is exactly the concrete clarity we needed, thank you @megahirt !

@longrunningprocess

Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean. A note about this fix: - All data is normalized to NFC in the database on write. It's been this way for years. - @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach. - This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works. Fixes #1244

megahirt · 2022-01-13T09:47:06Z

I have tested this on QA and it is working as expected.

longrunningprocess added the triage label Nov 11, 2021

longrunningprocess assigned longrunningprocess, megahirt and alex-larkin Nov 11, 2021

longrunningprocess added research A Research-based task that is not expected to generate code or a PR and removed triage labels Nov 11, 2021

megahirt changed the title ~~Should search terms be "normalized" before attempting to match entries?~~ Convert search terms to NFC Jan 7, 2022

megahirt mentioned this issue Jan 7, 2022

normalize search string to NFC before comparison #1272

Merged

4 tasks

megahirt added this to the 1.12 milestone Jan 11, 2022

megahirt closed this as completed in #1272 Jan 11, 2022

megahirt added the bug An existing problem with our app in production label Jan 13, 2022

megahirt added this to Language Forge Classic Jan 13, 2022

josephmyers moved this to To Do in Language Forge Classic Feb 11, 2022

josephmyers removed this from Language Forge Classic Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert search terms to NFC #1244

Convert search terms to NFC #1244

longrunningprocess commented Nov 11, 2021

alex-larkin commented Nov 11, 2021

alex-larkin commented Jan 6, 2022

megahirt commented Jan 7, 2022 •

edited

Loading

megahirt commented Jan 7, 2022

longrunningprocess commented Jan 7, 2022

megahirt commented Jan 13, 2022

Convert search terms to NFC #1244

Convert search terms to NFC #1244

Comments

longrunningprocess commented Nov 11, 2021

alex-larkin commented Nov 11, 2021

alex-larkin commented Jan 6, 2022

megahirt commented Jan 7, 2022 • edited Loading

megahirt commented Jan 7, 2022

longrunningprocess commented Jan 7, 2022

megahirt commented Jan 13, 2022

megahirt commented Jan 7, 2022 •

edited

Loading