Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All: Resolves #4613: Improve search with Asian scripts #5018

Merged
merged 11 commits into from
Jun 7, 2021

Conversation

mablin7
Copy link
Contributor

@mablin7 mablin7 commented May 27, 2021

Fixes #4613

As discussed on this forum thread, I've created a new search mode besides basic and FTS, called SEARCH_TYPE_NONLATIN_SCRIPT, which has all of the features of FTS, but works for languages which the latter does not support. This new mode is automatically selected when a character from either Chinese, Japanese, Korean or Taiwanese is detected in the search string.

Most of this is done by refactoring functions in queryBuilder to be more generic and work without FTS, plus a few changes to SearchEngine to create and use the new search mode. I also added tests in a new file, by copying the existing SearchFilter tests, changing the strings to Chinese text and removing some irrelevant cases.

I tested this myself with randomly generated Chinese texts and to me it seemed to work fine, but I don't speak Chinese, so it could be that I missed something. It could be great if someone who does speak one of the above mentioned languages could try this fix on their collection (after backing it up of course 😉 )! If needed I can provide pre-built binaries for windows or linux.

Supported features

  • Multi-term search - match notes which contain all of the space separated terms
  • Quoted terms - match an exact sequence of terms
  • Wildcards - use * as a wildcard
  • - negated terms
  • All filters that FTS supports: type, any, iscompleted, etc.

Caveats

  • Matches are not ranked smartly - the ranking algorithm relies on extra metadata returned by FTS
  • On very large collections of notes or mobile devices it's probably slower, however I was not able to measure a significant slowdown on my fairly outdated desktop, with test sets of 1000 notes, 5000 characters each:
Time
FTS search * ~800ms
Basic search ~800ms
New search mode ~900ms

*On an equal number of English notes

Copy link
Owner

@laurent22 laurent22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the test units, can't we do better than copying and pasting the complete test units? Your implementation is smart as you add the feature while making only a few changes. So the logic is that it's possible to do something similar with the tests.

For example, can't you run the test unit once with useFts on, then a second time with useFts off? Using Chinese characters is not necessary to check the new search type, is it? So maybe you could run them with the English strings that are in the original tests but with useFts = false.

In synchronizer_MigrationHandler.test.ts, the same tests run for each sync target migration. Maybe something similar can be done here?

packages/lib/services/searchengine/SearchEngine.ts Outdated Show resolved Hide resolved
packages/lib/services/searchengine/SearchEngine.ts Outdated Show resolved Hide resolved
packages/lib/services/searchengine/queryBuilder.ts Outdated Show resolved Hide resolved
packages/lib/services/searchengine/queryBuilder.ts Outdated Show resolved Hide resolved
packages/lib/services/searchengine/queryBuilder.ts Outdated Show resolved Hide resolved
@laurent22
Copy link
Owner

Matches are not ranked smartly - the ranking algorithm relies on extra metadata returned by FTS

Actually what would be the ranking logic for this new search type?

@mablin7
Copy link
Contributor Author

mablin7 commented May 29, 2021

Thanks for the feedback! I hope it's looking better now.

Actually what would be the ranking logic for this new search type?

Well now it's the same as basic, which is based on just a few simple metrics defined here

processBasicSearchResults_(rows: any[], parsedQuery: any) {
const valueRegexs = parsedQuery.keys.includes('_') ? parsedQuery.terms['_'].map((term: any) => term.valueRegex || term.value) : [];
const isTitleSearch = parsedQuery.keys.includes('title');
const isOnlyTitle = parsedQuery.keys.length === 1 && isTitleSearch;
for (let i = 0; i < rows.length; i++) {
const row = rows[i];
const testTitle = (regex: any) => new RegExp(regex, 'ig').test(row.title);
const matchedFields: any = {
title: isTitleSearch || valueRegexs.some(testTitle),
body: !isOnlyTitle,
};
row.fields = Object.keys(matchedFields).filter((key: any) => matchedFields[key]);
row.weight = 0;
row.fuzziness = 0;
}
}

I'm not sure how it could be improved without the extra data that FTS provides, I haven't really looked into how the BM25 algorithm works. Maybe a future PR :)

@laurent22
Copy link
Owner

I'm not sure how it could be improved without the extra data that FTS provides, I haven't really looked into how the BM25 algorithm works. Maybe a future PR :)

Yes this is good enough for now. Actually could you please update the documentation and mention how this mode works, and to which languages it applies?

Regarding the tests, are you confident that the existing filter tests will also cover your new search type? Is there anything specific to it that could be covered in an additional test?

@mablin7
Copy link
Contributor Author

mablin7 commented May 29, 2021

Actually could you please update the documentation and mention how this mode works, and to which languages it applies?

I hope I added it in the right place.

Regarding the tests, are you confident that the existing filter tests will also cover your new search type? Is there anything specific to it that could be covered in an additional test?

No, I couldn't think of any specific cases. The main logic is pretty much left intact, mostly only the table names are changed. The filter that required the most change is the text filter, but that actually is simpler in non-FTS mode, than in FTS and the existing tests cover all the things that can go wrong, in my opinion.

Copy link
Owner

@laurent22 laurent22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @mablin7. I've just left one comment and then I think it's ready to merge.

README.md Outdated
@@ -407,6 +407,12 @@ For more information see [Plugins](https://github.com/laurent22/joplin/blob/dev/

Joplin implements the SQLite Full Text Search (FTS4) extension. It means the content of all the notes is indexed in real time and search queries return results very fast. Both [Simple FTS Queries](https://www.sqlite.org/fts3.html#simple_fts_queries) and [Full-Text Index Queries](https://www.sqlite.org/fts3.html#full_text_index_queries) are supported. See below for the list of supported queries:

One caveat of SQLite FTS is that it does not support languages which do not use Latin word boundaries (spaces, tabs, punctuation). To solve this issue, Joplin has a custom search mode, that does not use FTS, but still has all of its features (multi term search, filters, etc.). Its only drawback is that it can get slow on larger note collections. This search mode is currently enabled if one of the following languages are detected:
Copy link
Owner

@laurent22 laurent22 Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should be "Its only drawbacks" => "One of its drawback". Also please mention that sorting is less accurate (since for FTS we can use BM25). And how about prefix queries? Do they work with this new mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Its only drawbacks" => "One of its drawback"

Yeah, that was probably an overstatement 😃

And how about prefix queries? Do they work with this new mode?

Yes! Well technically, in this mode every query is a prefix (and also a suffix) query, because it's really just substring matching. In fact, unlike in FTS, here these would work too: *swim, ast*rix

@laurent22
Copy link
Owner

Great, thanks for the update @mablin7!

@laurent22 laurent22 merged commit 62a371b into laurent22:dev Jun 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The combined search returns no results for Chinese
2 participants