Add multi language tokenizer #3145

fmassot · 2023-04-07T11:45:42Z

Add a simple multi-language tokenizer with language detection.

TODO

Check the size of dictionaries. Should we include them in the binary? => yes, 22MB seems fine.
Add the possibility to bypass the language autodetection with a prefix of type {language-identifier}:
test on a Wikipedia dataset
bench tokenizer
choose the tokenizer name: ~~multi_language~~ or multilanguage
Expose SimpleTokenStream in tantivy main branch
Add korean support
Add docs (I will open a dedicated issue)
Add a feature multilanguage?

Binary size

On macOS, using Lindera with the compress features adds 44 MB (97MB -> 141MB).

Benchmarks

Tokenizer	Throughput
Default short	64 MiB/s
Default long	100 MiB/s
Multilanguage default short	41 MiB/s
Multilanguage default long	55 MiB/s
Multilanguage jpn short	6 MiB/s
Multilanguage jpn long	7.2 MiB/s
Multilanguage cmn long	8 MiB/s
Multilanguage cmn long	12 MiB/s
chinese-compatible short	31 MiB/s
chinese-compatible long	36 MiB/s
jpn (lindera) short	6.8 MiB/s
jpn (lindera) medium	8.5 MiB/s

Mapping example:

version: 0.5

index_id: multilang

doc_mapping:
  field_mappings:
    - name: body
      type: text
      tokenizer: multilanguage
      record: position

quickwit/quickwit-search/src/cluster_client.rs

quickwit/quickwit-search/src/collector.rs

fmassot · 2023-04-13T16:30:54Z

quickwit/Cargo.toml

 # This is actually not used directly the goal is to fix the version
-# used by reqwest. 0.8.30 has an unclear license.
-encoding_rs = "=0.8.29"
+# used by reqwest.


The license is clear: (Apache-2.0 OR MIT) AND BSD-3-Clause so no pb here.

…ionaries.

fmassot · 2023-06-26T14:49:42Z

Closing, will open a new PR soon.

fmassot commented Apr 7, 2023

View reviewed changes

quickwit/quickwit-search/src/cluster_client.rs Outdated Show resolved Hide resolved

fmassot commented Apr 7, 2023

View reviewed changes

quickwit/quickwit-search/src/collector.rs Outdated Show resolved Hide resolved

fmassot force-pushed the fmassot/multilanguage-tokenizer branch from b5bd1b1 to ad66f22 Compare April 7, 2023 11:56

fmassot changed the title ~~Add multilanguage tokenizer (WIP)~~ Add multi language tokenizer (WIP) Apr 7, 2023

fmassot force-pushed the fmassot/multilanguage-tokenizer branch 2 times, most recently from 78e96aa to 190efed Compare April 13, 2023 16:23

fmassot changed the title ~~Add multi language tokenizer (WIP)~~ Add multi language tokenizer Apr 13, 2023

fmassot mentioned this pull request Apr 13, 2023

Expose SimpleTokenStream to use it in quickwit for the multilanguage tokenizer quickwit-oss/tantivy#1994

Merged

fmassot commented Apr 13, 2023

View reviewed changes

fmassot force-pushed the fmassot/multilanguage-tokenizer branch from 190efed to 5405a08 Compare April 13, 2023 16:34

fmassot marked this pull request as ready for review April 14, 2023 11:48

fmassot force-pushed the fmassot/multilanguage-tokenizer branch from c347ff1 to 56d44c3 Compare April 14, 2023 14:40

fmassot added 11 commits April 15, 2023 00:27

WIP

28c2738

Add multilanguage tokenizer.

8e8b5be

Add benchs.

2aabc29

Add language prefix for multilangauge tokenizer, rename the tokenizer.

e2305dd

Add korean support in multilanguage tokenizer.

11573b5

Reorganize tokenizers.

79e8965

Renamings.

eb6b14d

Update tantivy revision and uncomment leaf search code.

01866c8

Use lazy tokenizer manager in tests to avoid reloading each time dict…

f72a701

…ionaries.

Avoid calling detect_language with empty string. Otherwise, it panicks.

ca665ba

Update cargo.lock

a757c3b

fmassot force-pushed the fmassot/multilanguage-tokenizer branch from 8fb2664 to a757c3b Compare April 14, 2023 22:32

Use lazy lindera tokenizers to load dictionaries only when needed.

99c083b

fmassot closed this Jun 26, 2023

guilload deleted the fmassot/multilanguage-tokenizer branch November 28, 2023 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi language tokenizer #3145

Add multi language tokenizer #3145

fmassot commented Apr 7, 2023 •

edited

Loading

fmassot Apr 13, 2023

fmassot commented Jun 26, 2023

Add multi language tokenizer #3145

Add multi language tokenizer #3145

Conversation

fmassot commented Apr 7, 2023 • edited Loading

Add a simple multi-language tokenizer with language detection.

TODO

Binary size

Benchmarks

fmassot Apr 13, 2023

Choose a reason for hiding this comment

fmassot commented Jun 26, 2023

fmassot commented Apr 7, 2023 •

edited

Loading