Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support lindera for japanese and korea tokenization #3218

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

chenkovsky
Copy link
Contributor

@chenkovsky chenkovsky commented Dec 9, 2024

Lindera is the successor of mecab, it supports multiple languages, currently CJK (Chinese, Japanese, Korea) are supported . Actually, users can build their own models for their languages. Quickwit, Meilisearch, Qdrant and ParadeDB are also integrated with it.

Lindera supports two model loading mechanism:

  1. include language model into compiled binary.
  2. load model from given path.

I integrated it with the second one. because, default language model is very old, (by the way, Jieba's default model is also very old), we have to update language model frequently, if we want to do some serious things. so bundling language model may be not a good idea.

In this pr. I defined a LANCE_TOKENIZERS_HOME env variable. user can put their models into this folder.

@github-actions github-actions bot added enhancement New feature or request python labels Dec 9, 2024
@chenkovsky
Copy link
Contributor Author

@wjones127 meilisearch uses whatlang-rs to detect language, and calls different tokenizers for different languages. can we implement similar function. it's very useful!
but I think it's not a good idea to couple language detection. morden language detection uses fasttext or neural network, it's hard to bundle into compiled binary. we also want to updated language detection model frequently. maybe we can pass an extra column that indicates the language of text when we create index.

for example:

create_index(...., language_column="column_name"})

@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 0% with 57 lines in your changes missing coverage. Please review.

Project coverage is 78.41%. Comparing base (84c6fc0) to head (b8c778e).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/tokenizer.rs 0.00% 57 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3218      +/-   ##
==========================================
- Coverage   78.55%   78.41%   -0.14%     
==========================================
  Files         244      244              
  Lines       84021    84611     +590     
  Branches    84021    84611     +590     
==========================================
+ Hits        66002    66350     +348     
- Misses      15225    15457     +232     
- Partials     2794     2804      +10     
Flag Coverage Δ
unittests 78.41% <0.00%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm okay with having this as an experimental feature. I think long term I want a plugin mechanism that could be shared with other tokenizers. See: #3222

@@ -141,9 +145,72 @@ fn build_base_tokenizer_builder(name: &str) -> Result<tantivy::tokenizer::TextAn
tantivy::tokenizer::RawTokenizer::default(),
)
.dynamic()),
#[cfg(feature = "tokenizer-lindera")]
s if s.starts_with("lindera/") => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, the tokenizer names start with lindera- not lindera/

Copy link
Contributor Author

@chenkovsky chenkovsky Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I haven't changed my unitttest yet. My idea is use a directory structure to manage different language model.

$LANCE_HOME/
    tokenizers/
        lindera/
            japanese-dic1/config.json
            japanese-dic2/config.json
            korea/config.json
        jieba/
            chinese/config.json

}

#[cfg(feature = "tokenizer-lindera")]
fn build_lindera_tokenizer_builder(dic: &str) -> Result<tantivy::tokenizer::TextAnalyzerBuilder> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does dic stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this dic is used as relative model path. it seems that tokenizer name will be written into final index file, so we cannot use absolute path. so I use relative path to $LANCE_HOME.

Some(p) => {
let dic_dir = p.join(dic);
let main_dir = dic_dir.join("main");
let user_config_path = dic_dir.join("user_config.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the schema of this JSON file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated implementation, with a clear configuration schema

Comment on lines 21 to 22
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this specific to lance + lindera

Suggested change
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance";
pub const LANCE_HOME_ENV_KEY: &str = "LANCE_LINDERA_HOME";
pub const LANCE_HOME_DEFAULT_DIRECTORY: &str = "lance_lindera";

Copy link
Contributor Author

@chenkovsky chenkovsky Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. global home path may be not a good idea. so I created a draft. But I think tokenizers can share same home.
different tokenizers are quite similar. they all contain
1. trie dictionary,
2. ngram model, for chinese, 1-gram model is enough, so sometimes there's no this part.
they all use viterbi algorithm.
that's why I think different tokenizers can share same home. BTW the code size of tokenizers are quite small, and have little effect on final wheel package size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved env into lance-index package

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants