Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilanguage tokenizer with language detection #3055

Closed
fmassot opened this issue Mar 20, 2023 · 2 comments · Fixed by #3608
Closed

Multilanguage tokenizer with language detection #3055

fmassot opened this issue Mar 20, 2023 · 2 comments · Fixed by #3608
Assignees
Labels
enhancement New feature or request

Comments

@fmassot
Copy link
Contributor

fmassot commented Mar 20, 2023

We want to support search on documents in different languages like common Latin languages (Eng, Fra, Deu...), Asian languages (Jpn, Cmn, ...), ...

To reach this goal, we need the following:

  • a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf whichlang repos.
  • specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is jieba for Chinese and lindera for Japanese.
  • one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.

Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:

// index_config.yaml
tokenizers:
  multilanguage_jpn_cmn:
    default: default    // default tokenizer used
    cmn: jieba            // tokenizer used if cmn is detected
    jpn: lindera.         // tokenizer used if jpn is detected
@fmassot fmassot added the enhancement New feature or request label Mar 20, 2023
@fmassot fmassot self-assigned this Mar 27, 2023
@fmassot fmassot moved this to 🏗 In progress in Quickwit 0.6 - End of May 2023 Mar 27, 2023
@fulmicoton
Copy link
Contributor

@fmassot Removing from 0.6. It would be nice to land this dangling PR though.

@fmassot
Copy link
Contributor Author

fmassot commented May 16, 2023

Yea this is important to keep that for the 0.6. I will resume the work soon as the end user will test it very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants