You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want to support search on documents in different languages like common Latin languages (Eng, Fra, Deu...), Asian languages (Jpn, Cmn, ...), ...
To reach this goal, we need the following:
a fast language detection algorithm as we don't want the detection phase to limit the indexing throughput. Cf whichlang repos.
specific tokenizers for each language: for Latin languages, we could keep the current default tokenizer and have dedicated tokenizers for languages that are not Latin based (Chinese, Japanese, ...). There is jieba for Chinese and lindera for Japanese.
one text field per language or one text field for all of them to store the tokens in the inverted index. Having one text field for all languages may be a good first step as managing several text fields adds extra complexity.
Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:
// index_config.yamltokenizers:
multilanguage_jpn_cmn:
default: default // default tokenizer usedcmn: jieba // tokenizer used if cmn is detectedjpn: lindera. // tokenizer used if jpn is detected
The text was updated successfully, but these errors were encountered:
We want to support search on documents in different languages like common Latin languages (
Eng
,Fra
, Deu...), Asian languages (Jpn
,Cmn
, ...), ...To reach this goal, we need the following:
Last but not least, we should specify how to declare this multilanguage tokenizer in the index config before jumping into the code.
For example, a user should be able to define his custom tokenizer like this:
The text was updated successfully, but these errors were encountered: