Multilanguage tokenizer with language detection #3055

fmassot · 2023-03-20T08:19:52Z

We want to support search on documents in different languages like common Latin languages (Eng, Fra, Deu...), Asian languages (Jpn, Cmn, ...), ...

To reach this goal, we need the following:

a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf whichlang repos.
specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is jieba for Chinese and lindera for Japanese.
one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.

Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:

// index_config.yaml
tokenizers:
  multilanguage_jpn_cmn:
    default: default    // default tokenizer used
    cmn: jieba            // tokenizer used if cmn is detected
    jpn: lindera.         // tokenizer used if jpn is detected

The text was updated successfully, but these errors were encountered:

fulmicoton · 2023-05-16T01:39:18Z

@fmassot Removing from 0.6. It would be nice to land this dangling PR though.

fmassot · 2023-05-16T02:26:00Z

Yea this is important to keep that for the 0.6. I will resume the work soon as the end user will test it very soon.

fmassot added the enhancement New feature or request label Mar 20, 2023

fmassot added this to Quickwit 0.6 - End of May 2023 Mar 20, 2023

fmassot self-assigned this Mar 27, 2023

fmassot moved this to 🏗 In progress in Quickwit 0.6 - End of May 2023 Mar 27, 2023

fmassot mentioned this issue Apr 7, 2023

Add multi language tokenizer #3145

Closed

9 tasks

fulmicoton removed this from Quickwit 0.6 - End of May 2023 May 16, 2023

fmassot mentioned this issue Jul 5, 2023

Add multilang tokenizer #3608

Merged

2 tasks

fmassot closed this as completed in #3608 Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilanguage tokenizer with language detection #3055

Multilanguage tokenizer with language detection #3055

fmassot commented Mar 20, 2023

fulmicoton commented May 16, 2023

fmassot commented May 16, 2023

Multilanguage tokenizer with language detection #3055

Multilanguage tokenizer with language detection #3055

Comments

fmassot commented Mar 20, 2023

fulmicoton commented May 16, 2023

fmassot commented May 16, 2023