-
Notifications
You must be signed in to change notification settings - Fork 14
Speed Comparison
This wiki shows the tokenization speed of Vibrato and other tokenizers and morphological analyzers.
We compare Vibrato 0.5.0 with MeCab and its reimplementations:
- MeCab (2020-09-14)
- Lindera (v0.23.0)
- sudachi.rs (v0.6.4-a1)
For Vibrato and MeCab, we evaluate two system dictionaries: IPADIC 2.7.0 and UniDic 3.1.1. For Lindera, we evaluate two versions: IPADIC and UniDic. sudachi.rs is evaluated for SudachiDict-core.
Further, we evaluate two compact versions of Vibrato UniDic models (distributed in our release page):
- raw-connector:
unidic-cwj-3_1_1+compact
- dual-connector
unidic-cwj-3_1_1+compact-dual
We also compare pointwise prediction-based tokenizers:
- KyTea (2020-04-03)
- Vaporetto (v0.6.1)
- rust-tinysegmenter (v0.1.1)
For Vaporetto and KyTea, we used the compact SVM model based on BCCWJ and UniDic downloaded from KyTea Models page.
We tokenize all sentences in I Am a Cat (by Soseki Natsume), which is available at Aozora Bunko, and report the elapsed time averaged on 100 runs.
- Number of sentences: 2,346
- Number of characters per sentence: 158.8
The benchmark code can be found here.
The following is the specification of the used machine:
- CPU: Intel Core i9-12900K (L3: 30MB cache, 16 Core, 3.2GHz-5.2GHz)
- RAM: 64GB (2×32GB, DDR5)
- OS: Ubuntu 22.04
Library (dict) | Elapsed time [ms] | STD |
---|---|---|
Vibrato 0.5.0 (ipadic-mecab 2.7.0) | 42 | 1.24 |
Vibrato 0.5.0 (unidic-cwj 3.1.1) | 75 | 1.71 |
Vibrato 0.5.0 (unidic-cwj 3.1.1, raw-connector) | 1364 | 5.14 |
Vibrato 0.5.0 (unidic-cwj 3.1.1, dual-connector) | 170 | 2.50 |
MeCab 2020-09-14 (ipadic-mecab 2.7.0) | 87 | 1.24 |
MeCab 2020-09-14 (unidic-cwj 3.1.1) | 179 | 2.88 |
Lindera 0.23.0 (ipadic) | 97 | 1.13 |
Lindera 0.23.0 (unidic) | 156 | 2.11 |
sudachi.rs 0.6.4-a1 (core, 20210802) | 220 | 4.74 |
KyTea 2020-04-03 (jp-0.4.7-5) | 169 | 2.83 |
Vaporetto 0.6.1 (jp-0.4.7-5) | 21 | 0.51 |
rust-tinysegmenter 0.1.1 | 166 | 1.69 |
Note that Vibrato UniDic models differ in size as follows. Thus, you can use the model with the time-space tradeoff of your choice.
Library (dict) | Model size [MB] |
---|---|
Vibrato 0.5.0 (unidic-cwj 3.1.1) | 717 |
Vibrato 0.5.0 (unidic-cwj 3.1.1, raw-connector) | 252 |
Vibrato 0.5.0 (unidic-cwj 3.1.1, dual-connector) | 300 |