-
Notifications
You must be signed in to change notification settings - Fork 14
Performance Comparison
This wiki shows the experimental results of time and memory for daachorse and other pattern-matching libraries.
We used the following two types of datasets for our experiments:
-
Word100, 5K, 15K
- Patterns: English words used in aho-corasick crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
Word100K
- Patterns: 100K English words used in fst crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
UniDic
- Patterns: 675K Japanese words from UniDic.
- Text: I Am a Cat (by Soseki Natsume) available at Aozora Bunko.
We compared the following Aho-Corasick implementations:
-
aho-corasick (v0.7.18)
- NFA
- DFA
- daachorse (v0.4.3, this library)
- bytewise
- charwise
In addition, we compared the following pattern matching automaton implementations:
The following is the specification of the used machine:
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- Memory: 64GiB
- OS: CentOS Linux release 7.5.1804 (Core)
The benchmark code can be found here.
We measured the execution time [ms] of the find_overlapping()
function provided by each Aho-Corasick implementation. For fst and yada, we performed common prefix searches from all positions in the target text.
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.22 | 6.50 | 7.36 | 11.28 | 28.66 |
aho-corasick (DFA) | 0.62 | 1.12 | 1.42 | 2.54 | 16.31 |
daachorse (bytewise) | 3.09 | 3.16 | 3.07 | 3.30 | 9.52 |
daachorse (charwise) | 2.60 | 2.51 | 2.23 | 2.10 | 5.50 |
fst | 12.63 | 17.51 | 19.65 | 26.27 | 44.13 |
yada | 3.10 | 4.21 | 4.28 | 4.68 | 9.48 |
We measured the execution time [ms] of the find_iter()
function provided by each Aho-Corasick implementation.
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.28 | 6.49 | 7.33 | 10.12 | 10.26 |
aho-corasick (DFA) | 0.61 | 1.09 | 1.35 | 2.04 | 2.39 |
daachorse (bytewise) | 3.09 | 3.16 | 3.05 | 2.97 | 0.98 |
daachorse (charwise) | 2.72 | 2.60 | 2.30 | 2.02 | 0.91 |
We measured the execution time [ms] of the leftmost_find_iter()
function (resp. find_iter()
) with the MatchKind::LeftmostLongest
specification provided by daachorse (resp. aho-corasick).
We measured the execution time [ms] of the leftmost_find_iter()
function (resp. find_iter()
) with the MatchKind:: LeftmostFirst
specification provided by daachorse (resp. aho-corasick).
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.08 | 6.42 | 7.15 | 10.94 | 22.86 |
aho-corasick (DFA) | 0.54 | 1.03 | 1.31 | 2.60 | 10.66 |
daachorse (bytewise) | 3.16 | 3.26 | 3.17 | 3.51 | 6.51 |
daachorse (charwise) | 2.79 | 2.68 | 2.39 | 2.31 | 4.22 |
We measured the memory usage [MiB] reported by heap_bytes()
provided by the Aho-Corasick implementations. For fst and yada, we measured the data buffer size obtained by their constructors.
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 0.071 | 0.667 | 1.5 | 12.8 | 63.5 |
aho-corasick (DFA) | 0.276 | 17.268 | 44.0 | 763.4 | 2664.2 |
daachorse (bytewise) | 0.013 | 0.426 | 1.0 | 8.6 | 28.2 |
daachorse (charwise) | 0.016 | 0.549 | 1.3 | 11.2 | 25.3 |
fst | 0.001 | 0.109 | 0.1 | 1.0 | 2.8 |
yada | 0.005 | 0.346 | 0.3 | 2.9 | 9.4 |