Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use FTS5 Tokenizers to search Chinese ? #413

Closed
qiulang opened this issue Sep 21, 2018 · 10 comments
Closed

How can I use FTS5 Tokenizers to search Chinese ? #413

qiulang opened this issue Sep 21, 2018 · 10 comments
Labels

Comments

@qiulang
Copy link

qiulang commented Sep 21, 2018

What did you do?

I have carefully read your FTS5 Tokenizers document and think maybe I can use a good Chinese Word Segmentation library, e.g https://github.com/yanyiwu/cppjieba/blob/master/README_EN.md, to do a better FTS in sqlite. But I have faced some problems.

First of all, have you noticed FTS5 Unicode61 Tokenizer seems not support CJK at all!
I asked the question at SO and sqlite mail list

No answer yet.

So if FTS5 Unicode61 Tokenizer does indeed not support Chinese then I doubt if I use your FTS5WrapperTokenizer with unicode61 as wrappedTokenizer to work for Chinese?

I tried to build a sqlite icu version but failed. All the documents I googled seem outdated. So I did not try your FTS3 tokenizer. I don't know if sqlite does not support icu can fts still work for CJK ? I have limited knowledge about sqlite in this area so correct me if I was wrong.

Second, say I find a fts5 tokenizer to tokenize Chinese characters one by one, then I can use https://github.com/yanyiwu/cppjieba/ to further segment Chinese words. To use your FTS5WrapperTokenizer, I have to implement the accept(token:flags:for:tokenCallback:) method can call cppjieba, right ? And since it is c++ I have to wrap it swift

Or I implement FTS5CustomTokenizer to call cppjieba directly. Which one is easier from your point of view?

Thanks!

Qiulang

Environment

GRDB flavor(s): GRDB
GRDB version: latest
Xcode version: 10
Swift version: 4
**Platform(s) running GRDB:**macOS
macOS version running Xcode: 10.13

@groue
Copy link
Owner

groue commented Sep 21, 2018

Hello @qiulang,

I have carefully read your FTS5 Tokenizers document and think maybe I can use a good Chinese Word Segmentation library, e.g https://github.com/yanyiwu/cppjieba/blob/master/README_EN.md, to do a better FTS in sqlite. But I have faced some problems.

All right, let's see how we can help.

First of all, have you noticed FTS5 Unicode61 Tokenizer seems not support CJK at all!
I asked the question at SO and sqlite mail list

No answer yet.

I can answer you: the built-in tokenizers (unicode61 and others) are latin-oriented. They split strings into tokens according to latin word boundaries (spaces, tabs, newlines, etc.). They don't handle Chinese and generally languages that do not use latin word boundaries. A Chinese sentence such as 你會說中文嗎? will give a single token: 你會說中文嗎.

So if FTS5 Unicode61 Tokenizer does indeed not support Chinese then I doubt if I use your FTS5WrapperTokenizer with unicode61 as wrappedTokenizer to work for Chinese?

Exactly. FTS5WrapperTokenizer helps adapting an existing tokenizer. If no existing tokenizer can handle Chinese well, FTS5WrapperTokenizer won't do a good job. More details below:

Second, say I find a fts5 tokenizer to tokenize Chinese characters one by one, then I can use https://github.com/yanyiwu/cppjieba/ to further segment Chinese words. To use your FTS5WrapperTokenizer, I have to implement the accept(token:flags:for:tokenCallback:) method can call cppjieba, right ? And since it is c++ I have to wrap it swift

Or I implement FTS5CustomTokenizer to call cppjieba directly. Which one is easier from your point of view?

In FTS5WrapperTokenizer.accept(token:flags:for:tokenCallback:), you can attempt at splitting a token such as 你會說中文嗎 into sub-tokens that you register with tokenCallback.

Search will probably work. But FTS5 won't be able to locate the matched tokens in the original text. SQL functions like snippet won't give good results.

Another solution is the FTS5CustomTokenizer protocol, with which you can build a genuine full-featured FTS5 tokenizer.

You'd have to adapt cppjieba (assuming it can do the job) so that it can feed FTS5CustomTokenizer.tokenize(context:tokenization:pText:nText:tokenCallback:). This method exactly matches the xTokenize function described at https://www.sqlite.org/fts5.html#custom_tokenizers. It will be all C, C++, and raw pointers everywhere.

A third solution is to register an existing Chinese FTS5 tokenizer, if any exists. You'll likely use the low-level SQLite connection handle (see Raw SQLite Pointers).

I frankly don't know which solution is the easier, because all of them look non trivial to me 😅

@groue groue added the question label Sep 21, 2018
@groue
Copy link
Owner

groue commented Sep 21, 2018

Reading GRDB code in order to recollect my memories of FTS5, I realize that custom tokenizers may not work well with Database Pools (reads will not know about the custom tokenizer, and fail). During your experimentations, use a Database Queue. The fix for Database Pools will come eventually.

@qiulang
Copy link
Author

qiulang commented Sep 21, 2018

Thanks for answering my question and sharing your experience (and confirm my doubt)!
To my knowledge no Chinese FTS5 tokenizer exists :(
Tencent's wcdb has a fts3 tokenizer https://github.com/Tencent/wcdb/blob/master/fts/mm_tokenizer.c

May I ask another question ? I failed to build an icu version. All the documents I googled seem outdated. Do you have experience for that ? And will an icu version help here ?

@groue
Copy link
Owner

groue commented Sep 21, 2018

GRDB does not support (yet) custom FTS3 tokenizers. It may be a good idea to bring this feature in?

And no, I don't know much about icu.

I'm sorry I can't help more right now. I admit you're entering uncharted territory. Whatever solution you choose, please share your experience: it will be invaluable for other users who have demanding tokenization needs!!!

@qiulang
Copy link
Author

qiulang commented Sep 22, 2018

Yes I will share my whatever I find!

@groue
Copy link
Owner

groue commented Sep 23, 2018

[...] custom tokenizers may not work well with Database Pools [...]

The fix is out, in GRDB 3.3.1 (see PR #414).

@qiulang
Copy link
Author

qiulang commented Sep 24, 2018

What I find so far is that ICU version can "partially" support Chinese, but FTS5 does not support Chinese. Please refer to what I discussed with others in sqlite mail list, here and here

@groue
Copy link
Owner

groue commented Sep 24, 2018

All right, @qiulang. Thanks for the links! I'll close this issue now. If you think GRDB needs any enhancement on this front, will you please open a new issue? Meanwhile, happy indexing!

@dadiorchen
Copy link

@qiulang hi, did you get any progress on this, is there any possible solution to FTS Chinese on Android?

@wjkoh
Copy link

wjkoh commented Aug 4, 2024

Signal messenger recently released Signal-FTS5-Extension, claiming it "enables applications to support CJK symbols in full-text search when used as a custom FTS5 tokenizer." The tokenizer likely implements ICU tokenization. Consider testing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants