Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilang tokenizer #3608

Merged
merged 12 commits into from
Jul 17, 2023
Merged

Add multilang tokenizer #3608

merged 12 commits into from
Jul 17, 2023

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Jul 5, 2023

Fix #3055

Notes

  • Dependencies: tantivy-tokenizer-api is patched to a revision compatible with quickwit-oss/lindera-tantivy.
  • Size of the binary (macOS): before=122MB , after=167MB. I would say it is ok

Config

version: 0.5

index_id: multilang

doc_mapping:
  tokenizers:
    - name: multilang
      type: multilang
  field_mappings:
    - name: body
      type: text
      tokenizer: multilang

Bench results

multilang/default-tokenize-short
                        time:   [414.03 ns 422.72 ns 432.76 ns]
                        thrpt:  [63.908 MiB/s 65.426 MiB/s 66.799 MiB/s]
multilang/default-tokenize-long
                        time:   [5.9750 µs 6.0257 µs 6.0751 µs]
                        thrpt:  [100.94 MiB/s 101.77 MiB/s 102.63 MiB/s]
multilang/multilang-eng-tokenize-short
                        time:   [629.81 ns 636.43 ns 643.32 ns]
                        thrpt:  [42.990 MiB/s 43.456 MiB/s 43.913 MiB/s]
multilang/multilang-eng-tokenize-long
                        time:   [10.289 µs 10.411 µs 10.532 µs]
                        thrpt:  [58.224 MiB/s 58.899 MiB/s 59.598 MiB/s]
multilang/multilang-tokenize-short-with-prefix
                        time:   [440.60 ns 444.99 ns 449.44 ns]
                        thrpt:  [61.535 MiB/s 62.152 MiB/s 62.770 MiB/s]
multilang/multilang-tokenize-long-with-prefix
                        time:   [6.6794 µs 6.8555 µs 7.1222 µs]
                        thrpt:  [86.099 MiB/s 89.449 MiB/s 91.807 MiB/s]
multilang/multilang-tokenize-jpn-short
                        time:   [8.4424 µs 8.5504 µs 8.6566 µs]
                        thrpt:  [5.9490 MiB/s 6.0229 MiB/s 6.1000 MiB/s]
multilang/multilang-tokenize-jpn-long
                        time:   [105.99 µs 107.25 µs 108.52 µs]
                        thrpt:  [7.2766 MiB/s 7.3624 MiB/s 7.4500 MiB/s]
multilang/multilang-tokenize-cmn-short
                        time:   [4.9545 µs 5.0165 µs 5.0768 µs]
                        thrpt:  [8.4533 MiB/s 8.5549 MiB/s 8.6618 MiB/s]
multilang/multilang-tokenize-cmn-long
                        time:   [45.837 µs 46.306 µs 46.753 µs]
                        thrpt:  [10.179 MiB/s 10.277 MiB/s 10.382 MiB/s]
multilang/multilang-tokenize-kor-short
                        time:   [10.886 µs 11.041 µs 11.190 µs]
                        thrpt:  [2.8125 MiB/s 2.8503 MiB/s 2.8911 MiB/s]
multilang/multilang-tokenize-kor-long
                        time:   [134.09 µs 135.71 µs 137.49 µs]
                        thrpt:  [3.1491 MiB/s 3.1903 MiB/s 3.2289 MiB/s]
multilang/chinese-compatible-tokenize-cmn-short
                        time:   [1.3359 µs 1.3575 µs 1.3776 µs]
                        thrpt:  [31.152 MiB/s 31.613 MiB/s 32.124 MiB/s]
multilang/chinese-compatible-tokenize-cmn-long
                        time:   [10.389 µs 10.522 µs 10.668 µs]
                        thrpt:  [44.608 MiB/s 45.227 MiB/s 45.808 MiB/s]

TODO

  • add documentation
  • add bench

@fmassot fmassot mentioned this pull request Jul 5, 2023
@fmassot fmassot requested a review from fulmicoton July 5, 2023 22:02
@fmassot fmassot force-pushed the fmassot/multilang-v2 branch from f0dee91 to 234e47f Compare July 6, 2023 07:56
@fmassot fmassot force-pushed the fmassot/multilang-v2 branch from 234e47f to ef773de Compare July 9, 2023 13:44
@fmassot
Copy link
Contributor Author

fmassot commented Jul 10, 2023

Note that on multilang tokenizer first usage, we will see:

  • surge of 250MB in RAM
  • latency increased by 100ms (dictionary loading)

@fmassot fmassot force-pushed the fmassot/multilang-v2 branch 2 times, most recently from 8802c5e to e5da501 Compare July 11, 2023 04:55
@fmassot fmassot force-pushed the fmassot/multilang-v2 branch from e5da501 to c7475e0 Compare July 17, 2023 15:39
@fmassot fmassot merged commit 32a02a8 into main Jul 17, 2023
@fmassot fmassot deleted the fmassot/multilang-v2 branch July 17, 2023 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multilanguage tokenizer with language detection
2 participants