-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert search terms to NFC #1244
Comments
I found through this post a link to the Character Identifier tool that Marc Durdin made. I think we should connect with him to understand this issue better. |
Talked with Billy over Meet and here are some thoughts:
Before going forward, we need
One option for search purposes is to normalize both the search query and the data into a cache and then checking for matches. Although constantly normalizing the entire lexicon may tax the resources heavily. |
Some samples of NFD/NFC equivalent strings are in the comment this issue references. There are several examples. When doing a string comparison (byte comparison) you will not get matches for strings that look identical to the user because the underlying codepoints are not the same. What is curious to me is that our code seems to treat an NFC and NFD string as equivalent so that an exact match search does the right thing. I am digging in further to try and understand.
@jasonleenaylor does FLEx normalize data when it is stored on disk? I have a feeling it normalizes to NFD, but I'm not certain. Here is a test character for searching in both NFC and NFD form: â (NFC) These characters are canonically equivalent, look identical, yet contain different code points underneath and will not match when one is typed and the other is present in the data. The actual comparison happens on L478 of entryMeetsFilterCriteria() Consider the following Regex which demonstrates that an NFD and NFC sequence do not match without special treatment:
I am still investigating why the filter shows a match when the data and search term are not normalized. Update: figured it out. See next comment. |
I stand corrected. LF does normalize all strings to NFC on import or storage. Here is where it gets converted to NFC. This is why the strings match when I enter in NFD as a data field and then search using an NFC string. This becomes an NFC = NFC comparison. If I enter an NFD string into the search criteria, the exact match doesn't work as expected, since I now understand what is going on. This issue should now be defined as: "convert search term to NFC" when searching for exact match. That should do it. |
Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match queries to work regardless of form. Attempt to fix a bug where the default behavior of ignoring diacritics would cause missing search results for complex scripts with combining characters that are not diacritics (e.g. Japanese or Korean) fixes #1244
excellent findings, this is exactly the concrete clarity we needed, thank you @megahirt ! |
Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean. A note about this fix: - All data is normalized to NFC in the database on write. It's been this way for years. - @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach. - This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works. Fixes #1244
I have tested this on QA and it is working as expected. |
There may be languages or scenarios where a user will enter a search term and either match/miss entries because of incompatible UNICODE encodings, e.g., NFC or NFD. I'm not 100% sure whether this problem exists in LF or not but thought it would be good to research this possibility more in case there are users out there working in languages and having a lot of difficulty as a result of LF not "normalizing" inputs against the data being searched against.
This thought and a possible situation demonstrating a failing scenario first came up in discussions here: #1243 (comment)
The text was updated successfully, but these errors were encountered: