-
Notifications
You must be signed in to change notification settings - Fork 14
Performance Comparison
This wiki shows the experimental results of time and memory for daachorse and other pattern matching libraries.
We used the following two types of datasets for our experiments:
-
Word100, 5K, 15K
- Patterns: English words used in aho-corasick crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
Word100K
- Patterns: 100K English words used in fst crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
UniDic
- Patterns: 675K Japanese words from UniDic.
- Text: I Am a Cat (by Soseki Natsume) available at Aozora Bunko.
We compared the following Aho-Corasick implementations:
-
aho-corasick (v0.7.18)
- NFA
- DFA
- daachorse (v0.4.3, this library)
- bytewise
- charwise
In addition, we compared the following pattern matching automaton implementations:
The following is the specification of the used machine:
- CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
- Memory: 64GiB
- OS: CentOS Linux release 7.5.1804 (Core)
The benchmark code can be found here.
We measured the elapsed time [ms] to perform find_overlapping()
provided by the Aho-Corasick implementations. For fst and yada, we performed common prefix searches from all positions in the target text.
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.5 | 7.9 | 9.1 | 14.3 | 36.5 |
aho-corasick (DFA) | 0.8 | 1.9 | 2.3 | 3.7 | 23.9 |
daachorse (bytewise) | 3.2 | 3.5 | 3.6 | 4.2 | 11.4 |
daachorse (charwise) | 2.9 | 3.0 | 2.7 | 2.6 | 6.8 |
fst | 17.3 | 24.0 | 26.6 | 35.6 | 58.5 |
yada | 3.3 | 4.4 | 4.8 | 5.9 | 12.6 |
We measured the elapsed time [ms] to perform find_iter()
provided by the Aho-Corasick implementations.
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.4 | 7.9 | 9.2 | 12.5 | 12.2 |
aho-corasick (DFA) | 0.8 | 1.8 | 2.3 | 3.2 | 5.5 |
daachorse (bytewise) | 3.3 | 3.5 | 3.6 | 3.7 | 1.5 |
daachorse (charwise) | 2.9 | 3.0 | 2.8 | 2.5 | 1.5 |
We measured the elapsed time [ms] to perform leftmost_find_iter()
(resp. find_iter()
) with the MatchKind::LeftmostLongest
specification provided by daachorse (resp. aho-corasick).
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.3 | 7.6 | 8.7 | 14.4 | 29.4 |
aho-corasick (DFA) | 0.8 | 1.9 | 2.5 | 3.9 | 18.3 |
daachorse (bytewise) | 3.3 | 3.6 | 3.7 | 4.3 | 8.7 |
daachorse (charwise) | 3.2 | 3.5 | 3.4 | 3.4 | 5.3 |
We measured the elapsed time [ms] to perform leftmost_find_iter()
(resp. find_iter()
) with the MatchKind:: LeftmostFirst
specification provided by daachorse (resp. aho-corasick).
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.2 | 7.7 | 8.9 | 12.8 | 12.3 |
aho-corasick (DFA) | 0.8 | 1.9 | 2.3 | 3.6 | 6.1 |
daachorse (bytewise) | 3.5 | 3.6 | 3.7 | 4.1 | 3.0 |
daachorse (charwise) | 3.0 | 3.4 | 3.3 | 3.2 | 2.3 |
We measured the memory usage [MiB] reported by heap_bytes()
provided by the Aho-Corasick implementations. For fst and yada, we measured the data buffer size obtained by their constructors.
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 0.071 | 0.667 | 1.5 | 12.8 | 63.5 |
aho-corasick (DFA) | 0.276 | 17.268 | 44.0 | 763.4 | 2664.2 |
daachorse (bytewise) | 0.013 | 0.426 | 1.0 | 8.6 | 28.2 |
daachorse (charwise) | 0.016 | 0.549 | 1.3 | 11.2 | 25.3 |
fst | 0.001 | 0.109 | 0.1 | 1.0 | 2.8 |
yada | 0.005 | 0.346 | 0.3 | 2.9 | 9.4 |