Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert search terms to NFC #1244

Closed
longrunningprocess opened this issue Nov 11, 2021 · 6 comments · Fixed by #1272
Closed

Convert search terms to NFC #1244

longrunningprocess opened this issue Nov 11, 2021 · 6 comments · Fixed by #1272
Assignees
Labels
bug An existing problem with our app in production research A Research-based task that is not expected to generate code or a PR

Comments

@longrunningprocess
Copy link
Contributor

There may be languages or scenarios where a user will enter a search term and either match/miss entries because of incompatible UNICODE encodings, e.g., NFC or NFD. I'm not 100% sure whether this problem exists in LF or not but thought it would be good to research this possibility more in case there are users out there working in languages and having a lot of difficulty as a result of LF not "normalizing" inputs against the data being searched against.

This thought and a possible situation demonstrating a failing scenario first came up in discussions here: #1243 (comment)

@longrunningprocess longrunningprocess added research A Research-based task that is not expected to generate code or a PR and removed triage labels Nov 11, 2021
@alex-larkin
Copy link

Understanding how different ways of encoding characters and representing words affect language projects

I found through this post a link to the Character Identifier tool that Marc Durdin made. I think we should connect with him to understand this issue better.

@alex-larkin
Copy link

Talked with Billy over Meet and here are some thoughts:

  • Without failing use cases, this is still somewhat theoretical
  • We need to know how FLEx and Language Depot, LF encodes the data. Is NFC or NFD used?

Before going forward, we need

  • To talk with @megahirt about how data is encoded in LF (and LD and FLEx)
  • Find some failing use cases

One option for search purposes is to normalize both the search query and the data into a cache and then checking for matches. Although constantly normalizing the entire lexicon may tax the resources heavily.

@megahirt
Copy link
Collaborator

megahirt commented Jan 7, 2022

Some samples of NFD/NFC equivalent strings are in the comment this issue references. There are several examples. When doing a string comparison (byte comparison) you will not get matches for strings that look identical to the user because the underlying codepoints are not the same.

What is curious to me is that our code seems to treat an NFC and NFD string as equivalent so that an exact match search does the right thing. I am digging in further to try and understand.

LF does not normalize data on storage, however my experience tells me that most data as a whole is coming in as NFC. We store whatever the user types using their keyboard. Many keyboards produce NFC, some produce NFD.

@jasonleenaylor does FLEx normalize data when it is stored on disk? I have a feeling it normalizes to NFD, but I'm not certain.

Here is a test character for searching in both NFC and NFD form:

â (NFC)
â (NFD)

These characters are canonically equivalent, look identical, yet contain different code points underneath and will not match when one is typed and the other is present in the data.
image
(screenshot from the Character Identifier tool)

The actual comparison happens on L478 of entryMeetsFilterCriteria()

Consider the following Regex which demonstrates that an NFD and NFC sequence do not match without special treatment:

// Is "â" (NFC) equal to "â" (NFD) ?
/\u00E2/.test('\u0061\u0302');
// no it is not

image

I am still investigating why the filter shows a match when the data and search term are not normalized. Update: figured it out. See next comment.

@megahirt
Copy link
Collaborator

megahirt commented Jan 7, 2022

LF does not normalize data on storage, however my experience tells me that most data as a whole is coming in as NFC. We store whatever the user types using their keyboard. Many keyboards produce NFC, some produce NFD.

I stand corrected. LF does normalize all strings to NFC on import or storage. Here is where it gets converted to NFC. This is why the strings match when I enter in NFD as a data field and then search using an NFC string. This becomes an NFC = NFC comparison.

If I enter an NFD string into the search criteria, the exact match doesn't work as expected, since I now understand what is going on.

This issue should now be defined as: "convert search term to NFC" when searching for exact match. That should do it.

megahirt added a commit that referenced this issue Jan 7, 2022
Normalize the search string to NFC since all data in LF is normalized to NFC on disk.  This allows for exact match queries to work regardless of form.

Attempt to fix a bug where the default behavior of ignoring diacritics would cause missing search results for complex scripts with combining characters that are not diacritics (e.g. Japanese or Korean)

fixes #1244
@megahirt megahirt changed the title Should search terms be "normalized" before attempting to match entries? Convert search terms to NFC Jan 7, 2022
@longrunningprocess
Copy link
Contributor Author

excellent findings, this is exactly the concrete clarity we needed, thank you @megahirt !

@megahirt megahirt added this to the 1.12 milestone Jan 11, 2022
megahirt added a commit that referenced this issue Jan 11, 2022
Normalize the search string to NFC since all data in LF is normalized to NFC on disk.  This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean.

A note about this fix:
- All data is normalized to NFC in the database on write.  It's been this way for years.
- @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query.  This a fine approach.
- This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath.  Either way works.

Fixes #1244
@megahirt megahirt added the bug An existing problem with our app in production label Jan 13, 2022
@megahirt
Copy link
Collaborator

I have tested this on QA and it is working as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An existing problem with our app in production research A Research-based task that is not expected to generate code or a PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants