Skip to content

Performance Comparison

Koichi Akabe edited this page Jul 5, 2022 · 15 revisions

This wiki shows the experimental results of time and memory for daachorse and other pattern-matching libraries.

Experimental Setup

Datasets

We used the following two types of datasets for our experiments:

Implementations

We compared the following Aho-Corasick implementations:

  • aho-corasick (v0.7.18)
    • NFA
    • DFA
  • daachorse (v0.4.3, this library)
    • bytewise
    • charwise

In addition, we compared the following pattern matching automaton implementations:

Machine

The following is the specification of the used machine:

  • CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  • Memory: 64GiB
  • OS: CentOS Linux release 7.5.1804 (Core)

Experimental Result

The benchmark code can be found here.

Overlapping Match

We measured the execution time [ms] of the find_overlapping() function provided by each Aho-Corasick implementation. For fst and yada, we performed common prefix searches from all positions in the target text.

Library Words100 Words5K Words15K Words100K UniDic
aho-corasick (NFA) 4.22 6.50 7.36 11.28 28.66
aho-corasick (DFA) 0.62 1.12 1.42 2.54 16.31
daachorse (bytewise) 3.09 3.16 3.07 3.30 9.52
daachorse (charwise) 2.60 2.51 2.23 2.10 5.50
fst 12.63 17.51 19.65 26.27 44.13
yada 3.10 4.21 4.28 4.68 9.48

Non-Overlapping Shortest Match

We measured the execution time [ms] of the find_iter() function provided by each Aho-Corasick implementation.

Library Words100 Words5K Words15K Words100K UniDic
aho-corasick (NFA) 4.28 6.49 7.33 10.12 10.26
aho-corasick (DFA) 0.61 1.09 1.35 2.04 2.39
daachorse (bytewise) 3.09 3.16 3.05 2.97 0.98
daachorse (charwise) 2.72 2.60 2.30 2.02 0.91

Non-Overlapping Leftmost-Longest Match

We measured the execution time [ms] of the leftmost_find_iter() function (resp. find_iter()) with the MatchKind::LeftmostLongest specification provided by daachorse (resp. aho-corasick).

Non-Overlapping Leftmost-First Match

We measured the execution time [ms] of the leftmost_find_iter() function (resp. find_iter()) with the MatchKind:: LeftmostFirst specification provided by daachorse (resp. aho-corasick).

Library Words100 Words5K Words15K Words100K UniDic
aho-corasick (NFA) 4.08 6.42 7.15 10.94 22.86
aho-corasick (DFA) 0.54 1.03 1.31 2.60 10.66
daachorse (bytewise) 3.16 3.26 3.17 3.51 6.51
daachorse (charwise) 2.79 2.68 2.39 2.31 4.22

Memory usage

We measured the memory usage [MiB] reported by heap_bytes() provided by the Aho-Corasick implementations. For fst and yada, we measured the data buffer size obtained by their constructors.

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 0.071 0.667 1.5 12.8 63.5
aho-corasick (DFA) 0.276 17.268 44.0 763.4 2664.2
daachorse (bytewise) 0.013 0.426 1.0 8.6 28.2
daachorse (charwise) 0.016 0.549 1.3 11.2 25.3
fst 0.001 0.109 0.1 1.0 2.8
yada 0.005 0.346 0.3 2.9 9.4
Clone this wiki locally