-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support lindera for japanese and korea tokenization #3218
base: main
Are you sure you want to change the base?
Conversation
@wjones127 meilisearch uses whatlang-rs to detect language, and calls different tokenizers for different languages. can we implement similar function. it's very useful! for example:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3218 +/- ##
==========================================
- Coverage 78.55% 78.41% -0.14%
==========================================
Files 244 244
Lines 84021 84611 +590
Branches 84021 84611 +590
==========================================
+ Hits 66002 66350 +348
- Misses 15225 15457 +232
- Partials 2794 2804 +10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm okay with having this as an experimental feature. I think long term I want a plugin mechanism that could be shared with other tokenizers. See: #3222
@@ -141,9 +145,72 @@ fn build_base_tokenizer_builder(name: &str) -> Result<tantivy::tokenizer::TextAn | |||
tantivy::tokenizer::RawTokenizer::default(), | |||
) | |||
.dynamic()), | |||
#[cfg(feature = "tokenizer-lindera")] | |||
s if s.starts_with("lindera/") => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above, the tokenizer names start with lindera-
not lindera/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I haven't changed my unitttest yet. My idea is use a directory structure to manage different language model.
$LANCE_HOME/
tokenizers/
lindera/
japanese-dic1/config.json
japanese-dic2/config.json
korea/config.json
jieba/
chinese/config.json
} | ||
|
||
#[cfg(feature = "tokenizer-lindera")] | ||
fn build_lindera_tokenizer_builder(dic: &str) -> Result<tantivy::tokenizer::TextAnalyzerBuilder> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does dic
stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this dic is used as relative model path. it seems that tokenizer name will be written into final index file, so we cannot use absolute path. so I use relative path to $LANCE_HOME.
Some(p) => { | ||
let dic_dir = p.join(dic); | ||
let main_dir = dic_dir.join("main"); | ||
let user_config_path = dic_dir.join("user_config.json"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the schema of this JSON file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated implementation, with a clear configuration schema
rust/lance-core/src/lib.rs
Outdated
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME"; | ||
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this specific to lance + lindera
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME"; | |
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance"; | |
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_LINDERA_HOME"; | |
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance_lindera"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. global home path may be not a good idea. so I created a draft. But I think tokenizers can share same home.
different tokenizers are quite similar. they all contain
1. trie dictionary,
2. ngram model, for chinese, 1-gram model is enough, so sometimes there's no this part.
they all use viterbi algorithm.
that's why I think different tokenizers can share same home. BTW the code size of tokenizers are quite small, and have little effect on final wheel package size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved env into lance-index package
Lindera is the successor of mecab, it supports multiple languages, currently CJK (Chinese, Japanese, Korea) are supported . Actually, users can build their own models for their languages. Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.
Lindera supports two model loading mechanism:
I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently, if we want to do some serious things. so bundling language model may be not a good idea.
In this pr. I defined a LANCE_TOKENIZERS_HOME env variable. user can put their models into this folder.