How can I use FTS5 Tokenizers to search Chinese ? #413

qiulang · 2018-09-21T05:29:26Z

What did you do?

I have carefully read your FTS5 Tokenizers document and think maybe I can use a good Chinese Word Segmentation library, e.g https://github.com/yanyiwu/cppjieba/blob/master/README_EN.md, to do a better FTS in sqlite. But I have faced some problems.

First of all, have you noticed FTS5 Unicode61 Tokenizer seems not support CJK at all!
I asked the question at SO and sqlite mail list

No answer yet.

So if FTS5 Unicode61 Tokenizer does indeed not support Chinese then I doubt if I use your FTS5WrapperTokenizer with unicode61 as wrappedTokenizer to work for Chinese?

I tried to build a sqlite icu version but failed. All the documents I googled seem outdated. So I did not try your FTS3 tokenizer. I don't know if sqlite does not support icu can fts still work for CJK ? I have limited knowledge about sqlite in this area so correct me if I was wrong.

Second, say I find a fts5 tokenizer to tokenize Chinese characters one by one, then I can use https://github.com/yanyiwu/cppjieba/ to further segment Chinese words. To use your FTS5WrapperTokenizer, I have to implement the accept(token:flags:for:tokenCallback:) method can call cppjieba, right ? And since it is c++ I have to wrap it swift

Or I implement FTS5CustomTokenizer to call cppjieba directly. Which one is easier from your point of view?

Thanks!

Qiulang

Environment

GRDB flavor(s): GRDB
GRDB version: latest
Xcode version: 10
Swift version: 4
**Platform(s) running GRDB:**macOS
macOS version running Xcode: 10.13

The text was updated successfully, but these errors were encountered:

groue · 2018-09-21T06:36:25Z

Hello @qiulang,

I have carefully read your FTS5 Tokenizers document and think maybe I can use a good Chinese Word Segmentation library, e.g https://github.com/yanyiwu/cppjieba/blob/master/README_EN.md, to do a better FTS in sqlite. But I have faced some problems.

All right, let's see how we can help.

First of all, have you noticed FTS5 Unicode61 Tokenizer seems not support CJK at all!
I asked the question at SO and sqlite mail list

No answer yet.

I can answer you: the built-in tokenizers (unicode61 and others) are latin-oriented. They split strings into tokens according to latin word boundaries (spaces, tabs, newlines, etc.). They don't handle Chinese and generally languages that do not use latin word boundaries. A Chinese sentence such as 你會說中文嗎？ will give a single token: 你會說中文嗎.

So if FTS5 Unicode61 Tokenizer does indeed not support Chinese then I doubt if I use your FTS5WrapperTokenizer with unicode61 as wrappedTokenizer to work for Chinese?

Exactly. FTS5WrapperTokenizer helps adapting an existing tokenizer. If no existing tokenizer can handle Chinese well, FTS5WrapperTokenizer won't do a good job. More details below:

Second, say I find a fts5 tokenizer to tokenize Chinese characters one by one, then I can use https://github.com/yanyiwu/cppjieba/ to further segment Chinese words. To use your FTS5WrapperTokenizer, I have to implement the accept(token:flags:for:tokenCallback:) method can call cppjieba, right ? And since it is c++ I have to wrap it swift

Or I implement FTS5CustomTokenizer to call cppjieba directly. Which one is easier from your point of view?

In FTS5WrapperTokenizer.accept(token:flags:for:tokenCallback:), you can attempt at splitting a token such as 你會說中文嗎 into sub-tokens that you register with tokenCallback.

Search will probably work. But FTS5 won't be able to locate the matched tokens in the original text. SQL functions like snippet won't give good results.

Another solution is the FTS5CustomTokenizer protocol, with which you can build a genuine full-featured FTS5 tokenizer.

You'd have to adapt cppjieba (assuming it can do the job) so that it can feed FTS5CustomTokenizer.tokenize(context:tokenization:pText:nText:tokenCallback:). This method exactly matches the xTokenize function described at https://www.sqlite.org/fts5.html#custom_tokenizers. It will be all C, C++, and raw pointers everywhere.

A third solution is to register an existing Chinese FTS5 tokenizer, if any exists. You'll likely use the low-level SQLite connection handle (see Raw SQLite Pointers).

I frankly don't know which solution is the easier, because all of them look non trivial to me 😅

groue · 2018-09-21T06:43:06Z

Reading GRDB code in order to recollect my memories of FTS5, I realize that custom tokenizers may not work well with Database Pools (reads will not know about the custom tokenizer, and fail). During your experimentations, use a Database Queue. The fix for Database Pools will come eventually.

qiulang · 2018-09-21T06:58:43Z

Thanks for answering my question and sharing your experience (and confirm my doubt)!
To my knowledge no Chinese FTS5 tokenizer exists :(
Tencent's wcdb has a fts3 tokenizer https://github.com/Tencent/wcdb/blob/master/fts/mm_tokenizer.c

May I ask another question ? I failed to build an icu version. All the documents I googled seem outdated. Do you have experience for that ? And will an icu version help here ?

groue · 2018-09-21T13:11:17Z

GRDB does not support (yet) custom FTS3 tokenizers. It may be a good idea to bring this feature in?

And no, I don't know much about icu.

I'm sorry I can't help more right now. I admit you're entering uncharted territory. Whatever solution you choose, please share your experience: it will be invaluable for other users who have demanding tokenization needs!!!

qiulang · 2018-09-22T03:52:25Z

Yes I will share my whatever I find!

groue · 2018-09-23T10:35:32Z

[...] custom tokenizers may not work well with Database Pools [...]

The fix is out, in GRDB 3.3.1 (see PR #414).

qiulang · 2018-09-24T03:27:41Z

What I find so far is that ICU version can "partially" support Chinese, but FTS5 does not support Chinese. Please refer to what I discussed with others in sqlite mail list, here and here

groue · 2018-09-24T06:22:41Z

All right, @qiulang. Thanks for the links! I'll close this issue now. If you think GRDB needs any enhancement on this front, will you please open a new issue? Meanwhile, happy indexing!

dadiorchen · 2021-09-25T09:48:39Z

@qiulang hi, did you get any progress on this, is there any possible solution to FTS Chinese on Android?

wjkoh · 2024-08-04T20:03:45Z

Signal messenger recently released Signal-FTS5-Extension, claiming it "enables applications to support CJK symbols in full-text search when used as a custom FTS5 tokenizer." The tokenizer likely implements ICU tokenization. Consider testing it.

groue added the question label Sep 21, 2018

groue closed this as completed Sep 24, 2018

mablin7 mentioned this issue Mar 16, 2021

The combined search returns no results for Chinese laurent22/joplin#4613

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use FTS5 Tokenizers to search Chinese ? #413

How can I use FTS5 Tokenizers to search Chinese ? #413

qiulang commented Sep 21, 2018 •

edited

Loading

groue commented Sep 21, 2018 •

edited

Loading

groue commented Sep 21, 2018

qiulang commented Sep 21, 2018

groue commented Sep 21, 2018

qiulang commented Sep 22, 2018

groue commented Sep 23, 2018 •

edited

Loading

qiulang commented Sep 24, 2018 •

edited

Loading

groue commented Sep 24, 2018 •

edited

Loading

dadiorchen commented Sep 25, 2021

wjkoh commented Aug 4, 2024

How can I use FTS5 Tokenizers to search Chinese ? #413

How can I use FTS5 Tokenizers to search Chinese ? #413

Comments

qiulang commented Sep 21, 2018 • edited Loading

What did you do?

Environment

groue commented Sep 21, 2018 • edited Loading

groue commented Sep 21, 2018

qiulang commented Sep 21, 2018

groue commented Sep 21, 2018

qiulang commented Sep 22, 2018

groue commented Sep 23, 2018 • edited Loading

qiulang commented Sep 24, 2018 • edited Loading

groue commented Sep 24, 2018 • edited Loading

dadiorchen commented Sep 25, 2021

wjkoh commented Aug 4, 2024

qiulang commented Sep 21, 2018 •

edited

Loading

groue commented Sep 21, 2018 •

edited

Loading

groue commented Sep 23, 2018 •

edited

Loading

qiulang commented Sep 24, 2018 •

edited

Loading

groue commented Sep 24, 2018 •

edited

Loading