-
-
Notifications
You must be signed in to change notification settings - Fork 724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I use FTS5 Tokenizers to search Chinese ? #413
Comments
Hello @qiulang,
All right, let's see how we can help.
I can answer you: the built-in tokenizers (unicode61 and others) are latin-oriented. They split strings into tokens according to latin word boundaries (spaces, tabs, newlines, etc.). They don't handle Chinese and generally languages that do not use latin word boundaries. A Chinese sentence such as
Exactly. FTS5WrapperTokenizer helps adapting an existing tokenizer. If no existing tokenizer can handle Chinese well, FTS5WrapperTokenizer won't do a good job. More details below:
In Search will probably work. But FTS5 won't be able to locate the matched tokens in the original text. SQL functions like snippet won't give good results. Another solution is the FTS5CustomTokenizer protocol, with which you can build a genuine full-featured FTS5 tokenizer. You'd have to adapt cppjieba (assuming it can do the job) so that it can feed A third solution is to register an existing Chinese FTS5 tokenizer, if any exists. You'll likely use the low-level SQLite connection handle (see Raw SQLite Pointers). I frankly don't know which solution is the easier, because all of them look non trivial to me 😅 |
Reading GRDB code in order to recollect my memories of FTS5, I realize that custom tokenizers may not work well with Database Pools (reads will not know about the custom tokenizer, and fail). During your experimentations, use a Database Queue. The fix for Database Pools will come eventually. |
Thanks for answering my question and sharing your experience (and confirm my doubt)! May I ask another question ? I failed to build an icu version. All the documents I googled seem outdated. Do you have experience for that ? And will an icu version help here ? |
GRDB does not support (yet) custom FTS3 tokenizers. It may be a good idea to bring this feature in? And no, I don't know much about icu. I'm sorry I can't help more right now. I admit you're entering uncharted territory. Whatever solution you choose, please share your experience: it will be invaluable for other users who have demanding tokenization needs!!! |
Yes I will share my whatever I find! |
The fix is out, in GRDB 3.3.1 (see PR #414). |
All right, @qiulang. Thanks for the links! I'll close this issue now. If you think GRDB needs any enhancement on this front, will you please open a new issue? Meanwhile, happy indexing! |
@qiulang hi, did you get any progress on this, is there any possible solution to FTS Chinese on Android? |
Signal messenger recently released Signal-FTS5-Extension, claiming it "enables applications to support CJK symbols in full-text search when used as a custom FTS5 tokenizer." The tokenizer likely implements ICU tokenization. Consider testing it. |
What did you do?
I have carefully read your FTS5 Tokenizers document and think maybe I can use a good Chinese Word Segmentation library, e.g https://github.com/yanyiwu/cppjieba/blob/master/README_EN.md, to do a better FTS in sqlite. But I have faced some problems.
First of all, have you noticed FTS5 Unicode61 Tokenizer seems not support CJK at all!
I asked the question at SO and sqlite mail list
No answer yet.
So if FTS5 Unicode61 Tokenizer does indeed not support Chinese then I doubt if I use your FTS5WrapperTokenizer with
unicode61
aswrappedTokenizer
to work for Chinese?I tried to build a sqlite icu version but failed. All the documents I googled seem outdated. So I did not try your FTS3 tokenizer. I don't know if sqlite does not support icu can fts still work for CJK ? I have limited knowledge about sqlite in this area so correct me if I was wrong.
Second, say I find a fts5 tokenizer to tokenize Chinese characters one by one, then I can use https://github.com/yanyiwu/cppjieba/ to further segment Chinese words. To use your FTS5WrapperTokenizer, I have to implement the
accept(token:flags:for:tokenCallback:)
method can call cppjieba, right ? And since it is c++ I have to wrap it swiftOr I implement FTS5CustomTokenizer to call cppjieba directly. Which one is easier from your point of view?
Thanks!
Qiulang
Environment
GRDB flavor(s): GRDB
GRDB version: latest
Xcode version: 10
Swift version: 4
**Platform(s) running GRDB:**macOS
macOS version running Xcode: 10.13
The text was updated successfully, but these errors were encountered: