-
Notifications
You must be signed in to change notification settings - Fork 14
Performance Comparison
This wiki shows the experimental results of time and memory for daachorse and other pattern-matching libraries.
We used the following two types of datasets for our experiments:
-
Word100, 5K, 15K
- Patterns: English words used in the aho-corasick crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
Word100K
- Patterns: 100K English words used in the fst crate for benchmarking.
- Text: The Adventures of Sherlock Holmes (by Arthur Conan Doyle) available at Project Gutenberg.
-
UniDic
- Patterns: 675K Japanese words from UniDic.
- Text: I Am a Cat (by Soseki Natsume) available at Aozora Bunko.
We compared the following Aho-Corasick implementations:
-
aho-corasick (v0.7.18)
- NFA
- DFA
- daachorse (v1.0.0, this library)
- bytewise
- charwise
In addition, we compared the following pattern matching automaton implementations:
The following is the specification of the used machine:
- CPU: Intel Core i9-12900K (30MB cache, 16 Core, 3.2GHz-5.2GHz)
- Memory: 64GiB
- OS: Ubuntu 22.04
The benchmark code can be found here.
We measured the execution time [ms] of the find_overlapping()
function provided by each Aho-Corasick implementation. For fst and yada, we performed common prefix searches from all positions in the target text.
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.22 | 6.53 | 7.36 | 11.36 | 28.73 |
aho-corasick (DFA) | 0.62 | 1.11 | 1.41 | 2.54 | 16.31 |
daachorse (bytewise u64) | 3.09 | 3.17 | 3.08 | 3.31 | 9.56 |
daachorse (bytewise u32) | 3.08 | 3.15 | 3.07 | 3.32 | 9.56 |
daachorse (charwise u64) | 2.63 | 2.55 | 2.27 | 2.11 | 5.47 |
daachorse (charwise u32) | 2.67 | 2.57 | 2.30 | 2.14 | 5.52 |
fst | 12.76 | 17.67 | 19.75 | 26.57 | 44.23 |
yada | 3.06 | 4.23 | 4.35 | 4.71 | 9.54 |
We measured the execution time [ms] of the find_iter()
function provided by each Aho-Corasick implementation.
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.26 | 6.50 | 7.39 | 10.13 | 10.17 |
aho-corasick (DFA) | 0.61 | 1.09 | 1.35 | 2.05 | 2.42 |
daachorse (bytewise u64) | 3.13 | 3.19 | 3.10 | 3.02 | 1.00 |
daachorse (bytewise u32) | 3.15 | 3.19 | 3.10 | 3.03 | 1.01 |
daachorse (charwise u64) | 2.64 | 2.56 | 2.26 | 1.98 | 0.91 |
daachorse (charwise u32) | 2.64 | 2.56 | 2.27 | 1.98 | 0.91 |
We measured the execution time [ms] of the leftmost_find_iter()
function (resp. find_iter()
) with the MatchKind::LeftmostLongest
specification provided by daachorse (resp. aho-corasick).
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.06 | 6.29 | 7.09 | 10.97 | 23.77 |
aho-corasick (DFA) | 0.55 | 1.03 | 1.31 | 2.60 | 10.63 |
daachorse (bytewise u64) | 3.26 | 3.29 | 3.21 | 3.55 | 6.58 |
daachorse (bytewise u32) | 3.15 | 3.27 | 3.17 | 3.53 | 6.55 |
daachorse (charwise u64) | 2.82 | 2.70 | 2.41 | 2.32 | 4.25 |
daachorse (charwise u32) | 2.79 | 2.68 | 2.40 | 2.31 | 4.20 |
We measured the execution time [ms] of the leftmost_find_iter()
function (resp. find_iter()
) with the MatchKind:: LeftmostFirst
specification provided by daachorse (resp. aho-corasick).
Library | Words100 | Words5K | Words15K | Words100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 4.05 | 6.29 | 7.11 | 10.17 | 10.32 |
aho-corasick (DFA) | 0.55 | 1.04 | 1.30 | 2.33 | 3.45 |
daachorse (bytewise u64) | 3.15 | 3.25 | 3.18 | 3.34 | 1.53 |
daachorse (bytewise u32) | 3.15 | 3.25 | 3.19 | 3.34 | 1.53 |
daachorse (charwise u64) | 2.79 | 2.68 | 2.40 | 2.24 | 1.57 |
daachorse (charwise u32) | 2.73 | 2.65 | 2.35 | 2.20 | 1.58 |
We measured the memory usage [MiB] reported by heap_bytes()
provided by the Aho-Corasick implementations. For fst and yada, we measured the data buffer size obtained by their constructors.
Library | Word100 | Word5K | Word15K | Word100K | UniDic |
---|---|---|---|---|---|
aho-corasick (NFA) | 0.071 | 0.667 | 1.5 | 12.8 | 63.5 |
aho-corasick (DFA) | 0.276 | 17.268 | 44.0 | 763.4 | 2664.2 |
daachorse (bytewise u64) | 0.013 | 0.445 | 1.1 | 9.0 | 30.7 |
daachorse (bytewise u32) | 0.013 | 0.426 | 1.0 | 8.6 | 28.2 |
daachorse (charwise u64) | 0.016 | 0.568 | 1.4 | 11.6 | 27.8 |
daachorse (charwise u32) | 0.016 | 0.549 | 1.3 | 11.2 | 25.3 |
fst | 0.001 | 0.109 | 0.1 | 1.0 | 2.8 |
yada | 0.005 | 0.346 | 0.3 | 2.9 | 9.4 |