Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi language tokenizer #3145

Closed
wants to merge 12 commits into from
Closed

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Apr 7, 2023

Add a simple multi-language tokenizer with language detection.

Fix #3055

TODO

  • Check the size of dictionaries. Should we include them in the binary? => yes, 22MB seems fine.
  • Add the possibility to bypass the language autodetection with a prefix of type {language-identifier}:
  • test on a Wikipedia dataset
  • bench tokenizer
  • choose the tokenizer name: multi_language or multilanguage
  • Expose SimpleTokenStream in tantivy main branch
  • Add korean support
  • Add docs (I will open a dedicated issue)
  • Add a feature multilanguage?

Binary size

On macOS, using Lindera with the compress features adds 44 MB (97MB -> 141MB).

Benchmarks

Tokenizer Throughput
Default short 64 MiB/s
Default long 100 MiB/s
Multilanguage default short 41 MiB/s
Multilanguage default long 55 MiB/s
Multilanguage jpn short 6 MiB/s
Multilanguage jpn long 7.2 MiB/s
Multilanguage cmn long 8 MiB/s
Multilanguage cmn long 12 MiB/s
chinese-compatible short 31 MiB/s
chinese-compatible long 36 MiB/s
jpn (lindera) short 6.8 MiB/s
jpn (lindera) medium 8.5 MiB/s

Mapping example:

version: 0.5

index_id: multilang

doc_mapping:
  field_mappings:
    - name: body
      type: text
      tokenizer: multilanguage
      record: position

@fmassot fmassot force-pushed the fmassot/multilanguage-tokenizer branch from b5bd1b1 to ad66f22 Compare April 7, 2023 11:56
@fmassot fmassot changed the title Add multilanguage tokenizer (WIP) Add multi language tokenizer (WIP) Apr 7, 2023
@fmassot fmassot force-pushed the fmassot/multilanguage-tokenizer branch 2 times, most recently from 78e96aa to 190efed Compare April 13, 2023 16:23
@fmassot fmassot changed the title Add multi language tokenizer (WIP) Add multi language tokenizer Apr 13, 2023
# This is actually not used directly the goal is to fix the version
# used by reqwest. 0.8.30 has an unclear license.
encoding_rs = "=0.8.29"
# used by reqwest.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The license is clear: (Apache-2.0 OR MIT) AND BSD-3-Clause so no pb here.

@fmassot fmassot force-pushed the fmassot/multilanguage-tokenizer branch from 190efed to 5405a08 Compare April 13, 2023 16:34
@fmassot fmassot marked this pull request as ready for review April 14, 2023 11:48
@fmassot fmassot force-pushed the fmassot/multilanguage-tokenizer branch from c347ff1 to 56d44c3 Compare April 14, 2023 14:40
@fmassot fmassot force-pushed the fmassot/multilanguage-tokenizer branch from 8fb2664 to a757c3b Compare April 14, 2023 22:32
@fmassot
Copy link
Contributor Author

fmassot commented Jun 26, 2023

Closing, will open a new PR soon.

@fmassot fmassot closed this Jun 26, 2023
@guilload guilload deleted the fmassot/multilanguage-tokenizer branch November 28, 2023 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multilanguage tokenizer with language detection
1 participant