-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore diacritics by default when searching #1243
Ignore diacritics by default when searching #1243
Conversation
Reviewer can check this for browser support: https://caniuse.com/?search=unicode%20property%20escapes. Please ensure these are ok. I saw something about another property called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with a ⭐ !
I'm impressed with your independent work to take advantage of research resources (C Hubbard) and to design and implement the solution. You also took care to make sure that the UX is consistent across products and consulted Alex as power user to make sure we are getting it right.
This is great work and will be immediately helpful to users when it is shipped.
Your PR description and test scenarios are first class - well done.
// https://stackoverflow.com/a/37511463/10818013 | ||
|
||
|
||
return input.normalize('NFD').replace(/\p{Diacritic}/gu, '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💥 This is it! Well done, and elegant too. I didn't know about Regex Unicode Property Escapes so my own solution probably would have been multiple lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, I'm glad you're pleased with it, thanks for reviewing. Just to confirm, you're good with the browser support and you think \p{Diacritic}
is good, no need to further investigate InCombiningDiacriticalMarks
?
I would like to add an additional test case that demonstrates diacritics keyed in both NFC and NFD form. The test should demonstrate that diacritics are ignored regardless of NFD/NFC input. I snagged some images from the Unicode docs on normalization forms to show some sample characters to search for with their codepoints, where these diacritics are safely ignored in search. How you key in these characters depends on your OS and keyboard software. As I write this, I realize that "exact match" might not work in certain edge cases where the input query is not in the same normalized form (NFC/NFD) as the data being searched. The fix is of course to normalize the query and data before comparison. This would be a separate issue to look into, not part of this PR. |
Co-authored-by: Christopher Hirt <[email protected]>
Co-authored-by: Christopher Hirt <[email protected]>
Co-authored-by: Christopher Hirt <[email protected]>
ok, thanks for thinking of some additional use cases, I've felt like I was on shaking ground with my limited testing. I'll see if I can figure out if I can doing something in the browser console to test this. for my ref:
|
FYI @megahirt a brief NFD/NFC test: |
I'm going to go ahead and merge this under the assumption you did check the browser support and were ok as well as anticipating |
@longrunningprocess in my testing this, I realized that we didn't address the highlighting that matches a string. I'm pretty sure you are already aware of it, and that's fine. I'm mostly just commenting about it explicitly for future reference. The highlightMatches() method would need to be rewritten or modified to take into a account the part about ignoring diacritics. When I search for "chu" in the default mode, only the exact matches for "chu" are selected, even though the results show those that include diacritics, see below.
What I would expect to see is that all headwords that contain a form of "chu" are highlighted, including:
|
I wasn't even looking at it, I'm glad you put fresh eyes on it. I'll convert your finding to a new issue. |
Normalize the search string to NFC since all data in LF is normalized to NFC on disk. This allows for exact match or ignore diacritic queries to work regardless of form or language, e.g. Korean. A note about this fix: - All data is normalized to NFC in the database on write. It's been this way for years. - @longrunningprocess 's addition in #1243 normalized the query to NFD for the purposes of removing diacritics from the data and query. This a fine approach. - This PR could have chosen to normalize all data to NFD for comparison under all circumstances given the second point above, however I chose to stick with NFC since that is what the data is underneath. Either way works. Fixes #1244
Description
This feature will change the default search to include "diacritic-agnostic" results. It will also add a new advanced option allowing users to narrow results based on diacritics.
Fixes #1114
Type of Change
Screenshots
Provided with test cases below
How Has This Been Tested?
Project used for testing (5491 entries): Alex's project.zip
Functional testing (should work the same in both list and entry view)
chaca
orCHACA
should result in 124 matches with no advanced options selectedchaca
orCHACA
should result in 90 matches when "matching diacritics"chacä
orCHACÄ
should result in 124 matches with no advanced options selectedchacä
orCHACÄ
should result in 9 matches when "matching diacritics"chacå
orCHACÅ
should result in 124 matches with no advanced options selectedchacå
orCHACÅ
should result in 0 matches when "matching diacritics"Checklist: