All: Resolves #4613: Improve search with Asian scripts #5018

mablin7 · 2021-05-27T21:46:14Z

As discussed on this forum thread, I've created a new search mode besides basic and FTS, called SEARCH_TYPE_NONLATIN_SCRIPT, which has all of the features of FTS, but works for languages which the latter does not support. This new mode is automatically selected when a character from either Chinese, Japanese, Korean or Taiwanese is detected in the search string.

Most of this is done by refactoring functions in queryBuilder to be more generic and work without FTS, plus a few changes to SearchEngine to create and use the new search mode. I also added tests in a new file, by copying the existing SearchFilter tests, changing the strings to Chinese text and removing some irrelevant cases.

I tested this myself with randomly generated Chinese texts and to me it seemed to work fine, but I don't speak Chinese, so it could be that I missed something. It could be great if someone who does speak one of the above mentioned languages could try this fix on their collection (after backing it up of course 😉 )! If needed I can provide pre-built binaries for windows or linux.

Supported features

Multi-term search - match notes which contain all of the space separated terms
Quoted terms - match an exact sequence of terms
Wildcards - use * as a wildcard
- negated terms
All filters that FTS supports: type, any, iscompleted, etc.

Caveats

Matches are not ranked smartly - the ranking algorithm relies on extra metadata returned by FTS
On very large collections of notes or mobile devices it's probably slower, however I was not able to measure a significant slowdown on my fairly outdated desktop, with test sets of 1000 notes, 5000 characters each:

	Time
FTS search *	~800ms
Basic search	~800ms
New search mode	~900ms

*On an equal number of English notes

laurent22

For the test units, can't we do better than copying and pasting the complete test units? Your implementation is smart as you add the feature while making only a few changes. So the logic is that it's possible to do something similar with the tests.

For example, can't you run the test unit once with useFts on, then a second time with useFts off? Using Chinese characters is not necessary to check the new search type, is it? So maybe you could run them with the English strings that are in the original tests but with useFts = false.

In synchronizer_MigrationHandler.test.ts, the same tests run for each sync target migration. Maybe something similar can be done here?

packages/lib/services/searchengine/SearchEngine.ts

packages/lib/services/searchengine/SearchFilterNonLatin.test.js

packages/lib/services/searchengine/queryBuilder.ts

laurent22 · 2021-05-27T22:51:47Z

Matches are not ranked smartly - the ranking algorithm relies on extra metadata returned by FTS

Actually what would be the ranking logic for this new search type?

mablin7 · 2021-05-29T14:20:42Z

Thanks for the feedback! I hope it's looking better now.

Actually what would be the ranking logic for this new search type?

Well now it's the same as basic, which is based on just a few simple metrics defined here

joplin/packages/lib/services/searchengine/SearchEngine.ts

Lines 356 to 373 in 89bc181

    
           processBasicSearchResults_(rows: any[], parsedQuery: any) { 
        
           	const valueRegexs = parsedQuery.keys.includes('_') ? parsedQuery.terms['_'].map((term: any) => term.valueRegex || term.value) : []; 
        
           	const isTitleSearch = parsedQuery.keys.includes('title'); 
        
           	const isOnlyTitle = parsedQuery.keys.length === 1 && isTitleSearch; 
        
           	for (let i = 0; i < rows.length; i++) { 
        
           		const row = rows[i]; 
        
           		const testTitle = (regex: any) => new RegExp(regex, 'ig').test(row.title); 
        
           		const matchedFields: any = { 
        
           			title: isTitleSearch || valueRegexs.some(testTitle), 
        
           			body: !isOnlyTitle, 
        
           		}; 
        
           		row.fields = Object.keys(matchedFields).filter((key: any) => matchedFields[key]); 
        
           		row.weight = 0; 
        
           		row.fuzziness = 0; 
        
           	} 
        
           }

I'm not sure how it could be improved without the extra data that FTS provides, I haven't really looked into how the BM25 algorithm works. Maybe a future PR :)

laurent22 · 2021-05-29T14:35:33Z

I'm not sure how it could be improved without the extra data that FTS provides, I haven't really looked into how the BM25 algorithm works. Maybe a future PR :)

Yes this is good enough for now. Actually could you please update the documentation and mention how this mode works, and to which languages it applies?

Regarding the tests, are you confident that the existing filter tests will also cover your new search type? Is there anything specific to it that could be covered in an additional test?

mablin7 · 2021-05-29T22:22:49Z

Actually could you please update the documentation and mention how this mode works, and to which languages it applies?

I hope I added it in the right place.

Regarding the tests, are you confident that the existing filter tests will also cover your new search type? Is there anything specific to it that could be covered in an additional test?

No, I couldn't think of any specific cases. The main logic is pretty much left intact, mostly only the table names are changed. The filter that required the most change is the text filter, but that actually is simpler in non-FTS mode, than in FTS and the existing tests cover all the things that can go wrong, in my opinion.

laurent22

Thanks for the update @mablin7. I've just left one comment and then I think it's ready to merge.

laurent22 · 2021-06-06T22:57:21Z

README.md

@@ -407,6 +407,12 @@ For more information see [Plugins](https://github.com/laurent22/joplin/blob/dev/

 Joplin implements the SQLite Full Text Search (FTS4) extension. It means the content of all the notes is indexed in real time and search queries return results very fast. Both [Simple FTS Queries](https://www.sqlite.org/fts3.html#simple_fts_queries) and [Full-Text Index Queries](https://www.sqlite.org/fts3.html#full_text_index_queries) are supported. See below for the list of supported queries:

+One caveat of SQLite FTS is that it does not support languages which do not use Latin word boundaries (spaces, tabs, punctuation). To solve this issue, Joplin has a custom search mode, that does not use FTS, but still has all of its features (multi term search, filters, etc.). Its only drawback is that it can get slow on larger note collections. This search mode is currently enabled if one of the following languages are detected:


Probably should be "Its only drawbacks" => "One of its drawback". Also please mention that sorting is less accurate (since for FTS we can use BM25). And how about prefix queries? Do they work with this new mode?

"Its only drawbacks" => "One of its drawback"

Yeah, that was probably an overstatement 😃

And how about prefix queries? Do they work with this new mode?

Yes! Well technically, in this mode every query is a prefix (and also a suffix) query, because it's really just substring matching. In fact, unlike in FTS, here these would work too: *swim, ast*rix

laurent22 · 2021-06-07T14:15:00Z

Great, thanks for the update @mablin7!

mablin7 added 4 commits May 27, 2021 19:26

(wip) Improve search with non-latin scripts

083371d

Fix handling of *

39dde73

Add tests

909bb99

Cleanup

c2a242d

laurent22 reviewed May 27, 2021

View reviewed changes

mablin7 added 4 commits May 29, 2021 15:17

Refactor SearchEngine.ts

5348587

Rename useFTS to useFts

a52eca7

Refactor queryBuilder

3b5b271

Update tests

067dc24

mablin7 added 2 commits May 29, 2021 23:15

Add test case for checking if new mode is enabled

61511d1

Update README

048ac57

laurent22 reviewed Jun 6, 2021

View reviewed changes

Update README

a2908aa

laurent22 merged commit 62a371b into laurent22:dev Jun 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All: Resolves #4613: Improve search with Asian scripts #5018

All: Resolves #4613: Improve search with Asian scripts #5018

mablin7 commented May 27, 2021 •

edited

Loading

laurent22 left a comment

laurent22 commented May 27, 2021

mablin7 commented May 29, 2021

laurent22 commented May 29, 2021

mablin7 commented May 29, 2021 •

edited

Loading

laurent22 left a comment

laurent22 Jun 6, 2021 •

edited

Loading

mablin7 Jun 7, 2021

laurent22 commented Jun 7, 2021

		@@ -407,6 +407,12 @@ For more information see [Plugins](https://github.com/laurent22/joplin/blob/dev/

		Joplin implements the SQLite Full Text Search (FTS4) extension. It means the content of all the notes is indexed in real time and search queries return results very fast. Both [Simple FTS Queries](https://www.sqlite.org/fts3.html#simple_fts_queries) and [Full-Text Index Queries](https://www.sqlite.org/fts3.html#full_text_index_queries) are supported. See below for the list of supported queries:

		One caveat of SQLite FTS is that it does not support languages which do not use Latin word boundaries (spaces, tabs, punctuation). To solve this issue, Joplin has a custom search mode, that does not use FTS, but still has all of its features (multi term search, filters, etc.). Its only drawback is that it can get slow on larger note collections. This search mode is currently enabled if one of the following languages are detected:

All: Resolves #4613: Improve search with Asian scripts #5018

All: Resolves #4613: Improve search with Asian scripts #5018

Conversation

mablin7 commented May 27, 2021 • edited Loading

Supported features

Caveats

laurent22 left a comment

Choose a reason for hiding this comment

laurent22 commented May 27, 2021

mablin7 commented May 29, 2021

laurent22 commented May 29, 2021

mablin7 commented May 29, 2021 • edited Loading

laurent22 left a comment

Choose a reason for hiding this comment

laurent22 Jun 6, 2021 • edited Loading

Choose a reason for hiding this comment

mablin7 Jun 7, 2021

Choose a reason for hiding this comment

laurent22 commented Jun 7, 2021

mablin7 commented May 27, 2021 •

edited

Loading

mablin7 commented May 29, 2021 •

edited

Loading

laurent22 Jun 6, 2021 •

edited

Loading