-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All: Resolves #4613: Improve search with Asian scripts #5018
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the test units, can't we do better than copying and pasting the complete test units? Your implementation is smart as you add the feature while making only a few changes. So the logic is that it's possible to do something similar with the tests.
For example, can't you run the test unit once with useFts on, then a second time with useFts off? Using Chinese characters is not necessary to check the new search type, is it? So maybe you could run them with the English strings that are in the original tests but with useFts = false.
In synchronizer_MigrationHandler.test.ts, the same tests run for each sync target migration. Maybe something similar can be done here?
packages/lib/services/searchengine/SearchFilterNonLatin.test.js
Outdated
Show resolved
Hide resolved
Actually what would be the ranking logic for this new search type? |
Thanks for the feedback! I hope it's looking better now.
Well now it's the same as basic, which is based on just a few simple metrics defined here joplin/packages/lib/services/searchengine/SearchEngine.ts Lines 356 to 373 in 89bc181
I'm not sure how it could be improved without the extra data that FTS provides, I haven't really looked into how the BM25 algorithm works. Maybe a future PR :) |
Yes this is good enough for now. Actually could you please update the documentation and mention how this mode works, and to which languages it applies? Regarding the tests, are you confident that the existing filter tests will also cover your new search type? Is there anything specific to it that could be covered in an additional test? |
I hope I added it in the right place.
No, I couldn't think of any specific cases. The main logic is pretty much left intact, mostly only the table names are changed. The filter that required the most change is the text filter, but that actually is simpler in non-FTS mode, than in FTS and the existing tests cover all the things that can go wrong, in my opinion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @mablin7. I've just left one comment and then I think it's ready to merge.
README.md
Outdated
@@ -407,6 +407,12 @@ For more information see [Plugins](https://github.com/laurent22/joplin/blob/dev/ | |||
|
|||
Joplin implements the SQLite Full Text Search (FTS4) extension. It means the content of all the notes is indexed in real time and search queries return results very fast. Both [Simple FTS Queries](https://www.sqlite.org/fts3.html#simple_fts_queries) and [Full-Text Index Queries](https://www.sqlite.org/fts3.html#full_text_index_queries) are supported. See below for the list of supported queries: | |||
|
|||
One caveat of SQLite FTS is that it does not support languages which do not use Latin word boundaries (spaces, tabs, punctuation). To solve this issue, Joplin has a custom search mode, that does not use FTS, but still has all of its features (multi term search, filters, etc.). Its only drawback is that it can get slow on larger note collections. This search mode is currently enabled if one of the following languages are detected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be "Its only drawbacks" => "One of its drawback". Also please mention that sorting is less accurate (since for FTS we can use BM25). And how about prefix queries? Do they work with this new mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Its only drawbacks" => "One of its drawback"
Yeah, that was probably an overstatement 😃
And how about prefix queries? Do they work with this new mode?
Yes! Well technically, in this mode every query is a prefix (and also a suffix) query, because it's really just substring matching. In fact, unlike in FTS, here these would work too: *swim
, ast*rix
Great, thanks for the update @mablin7! |
Fixes #4613
As discussed on this forum thread, I've created a new search mode besides basic and FTS, called
SEARCH_TYPE_NONLATIN_SCRIPT
, which has all of the features of FTS, but works for languages which the latter does not support. This new mode is automatically selected when a character from either Chinese, Japanese, Korean or Taiwanese is detected in the search string.Most of this is done by refactoring functions in
queryBuilder
to be more generic and work without FTS, plus a few changes toSearchEngine
to create and use the new search mode. I also added tests in a new file, by copying the existingSearchFilter
tests, changing the strings to Chinese text and removing some irrelevant cases.I tested this myself with randomly generated Chinese texts and to me it seemed to work fine, but I don't speak Chinese, so it could be that I missed something. It could be great if someone who does speak one of the above mentioned languages could try this fix on their collection (after backing it up of course 😉 )! If needed I can provide pre-built binaries for windows or linux.
Supported features
*
as a wildcard-
negated termstype
,any
,iscompleted
, etc.Caveats
*On an equal number of English notes