Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The combined search returns no results for Chinese #4613

Closed
novelx opened this issue Mar 1, 2021 · 7 comments · Fixed by #5018
Closed

The combined search returns no results for Chinese #4613

novelx opened this issue Mar 1, 2021 · 7 comments · Fixed by #5018
Labels
bug It's a bug

Comments

@novelx
Copy link

novelx commented Mar 1, 2021

The fantastic search feature is a bold reason that I'm using Joplin. When I search one Chinese word it works pretty well, and it's fast and accurate. I recently found that when I type in two Chinese words seperated by ' '(space) in the search box, there's always no result returned, despite what words I search and what they are organized in the original notes. I used to think that it is as design, but recently I found that when I type in two English words it works pretty well and returns the notes that contain both of the words.

Environment

Joplin version:
Platform: macOS
OS specifics: Catalina 10.15.7

Steps to reproduce

  1. create a note with content "目录 文件" (two spaces between the Chinese words) or "文件目录"
  2. in the search bar, input "目录 文件" (one spaces between the Chinese words which implies "AND")
  3. No notes were found.

Describe what you expected to happen

The search results contain all of the notes that contain both word "目录" and word "文件".

Logfile

@novelx novelx added the bug It's a bug label Mar 1, 2021
@StinkyBenji
Copy link

Do you still have this issue? I just tried to reproduce your problem, I didn't run into the problem that you mentioned.

@mablin7
Copy link
Contributor

mablin7 commented Mar 16, 2021

I think I can reproduce it, but I don't speak Chinese, so @novelx please correct me if I misunderstood something.

Latin script, search finds note which contain all words:
image

Chinese test note:
image

Chinese test query (two non-consecutive characters taken from the text):
image

EDIT: I'm sorry, I overlooked this line while reading through the search engine code. Joplin seems to handle non-latin scripts by switching to basic search. In basic mode, the search looks for exact matches, so "目录 文件" only finds notes in which these "words" are next to each other, separated by a space. This also explains why the any filter doesn't work on Chinese text.

I still think my second suggestion (use substring matching - LIKE '%文%') should be considered, or at least this problem should be documented in the search section of the website, because, at the moment, the users have no reason to think the search behaves differently for other languages.

What follows is my original comment.

I believe the problem comes from Sqlite's FTS tokenizer and the fact that Chinese script does not require spaces between words. According to this comment:

...the built-in tokenizers (unicode61 and others) are latin-oriented. They split strings into tokens according to latin word boundaries (spaces, tabs, newlines, etc.). They don't handle Chinese and generally languages that do not use latin word boundaries. A Chinese sentence such as 你會說中文嗎? will give a single token: 你會說中文嗎.

One way to get around this is using a custom tokenizer, like this one implementing fast substring search for FTS, but it would require compiling a C module for all supported platform, which would complicate Joplin's build process.

A far simpler solution would be just not to use FTS for Chinese texts at all, and just use substring matching (LIKE '%文%'). This would be of course much slower, but requires minimal changes to the codebase.

PS. I discovered that the any:1 search filter doesn't seem to work for Chinese texts.

image
As you can see the characters match individually:
image
image

It might be a completely unrelated bug though.

@StinkyBenji
Copy link

ahh, you are right @mablin7, I didn't understand the issue correctly, I simply searched two words next to each other and it works fine as you described.
And the substring matching works well.

@novelx
Copy link
Author

novelx commented Mar 17, 2021

I think I can reproduce it, but I don't speak Chinese, so @novelx please correct me if I misunderstood something.

@mablin7 You are right! You did exactly what I mentioned in the original post to reproduce the issue.

@stale
Copy link

stale bot commented Apr 18, 2021

Hey there, it looks like there has been no activity on this issue recently. Has the issue been fixed, or does it still require the community's attention? This issue may be closed if no further activity occurs. You may comment on the issue and I will leave it open. Thank you for your contributions.

@stale stale bot added the stale An issue that hasn't been active for a while... label Apr 18, 2021
@novelx
Copy link
Author

novelx commented Apr 19, 2021

Anyway I want let it open and hope it will be fixed soon

@stale stale bot removed the stale An issue that hasn't been active for a while... label Apr 19, 2021
@leon0625
Copy link

Hope this problem can be fixed soon.
Searching for multiple Chinese keywords is so important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug It's a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants