Skip to content

Performance Comparison

Koichi Akabe edited this page Jun 7, 2022 · 15 revisions

This wiki shows the experimental results of time and memory for daachorse and other pattern matching libraries.

Experimental Setup

Datasets

We used the following two types of datasets for our experiments:

Implementations

We compared the following Aho-Corasick implementations:

  • aho-corasick (v0.7.18)
    • NFA
    • DFA
  • daachorse (v0.4.3, this library)
    • bytewise
    • charwise

In addition, we compared the following pattern matching automaton implementations:

Machine

The following is the specification of the used machine:

  • CPU: Intel(R) Core(TM) i7-8086K CPU @ 4.00GHz
  • Memory: 64GiB
  • OS: CentOS Linux release 7.5.1804 (Core)

Experimental Result

The benchmark code can be found here.

Overlapping Match

We measured the elapsed time [ms] to perform find_overlapping() provided by the Aho-Corasick implementations. For fst and yada, we performed common prefix searches from all positions in the target text.

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 4.5 7.9 9.1 14.3 36.5
aho-corasick (DFA) 0.8 1.9 2.3 3.7 23.9
daachorse (bytewise) 3.2 3.5 3.6 4.2 11.4
daachorse (charwise) 2.9 3.0 2.7 2.6 6.8
fst 17.3 24.0 26.6 35.6 58.5
yada 3.3 4.4 4.8 5.9 12.6

Non-Overlapping Shortest Match

We measured the elapsed time [ms] to perform find_iter() provided by the Aho-Corasick implementations.

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 4.4 7.9 9.2 12.5 12.2
aho-corasick (DFA) 0.8 1.8 2.3 3.2 5.5
daachorse (bytewise) 3.3 3.5 3.6 3.7 1.5
daachorse (charwise) 2.9 3.0 2.8 2.5 1.5

Non-Overlapping Leftmost-Longest Match

We measured the elapsed time [ms] to perform leftmost_find_iter() (resp. find_iter()) with the MatchKind::LeftmostLongest specification provided by daachorse (resp. aho-corasick).

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 4.3 7.6 8.7 14.4 29.4
aho-corasick (DFA) 0.8 1.9 2.5 3.9 18.3
daachorse (bytewise) 3.3 3.6 3.7 4.3 8.7
daachorse (charwise) 3.2 3.5 3.4 3.4 5.3

Non-Overlapping Leftmost-First Match

We measured the elapsed time [ms] to perform leftmost_find_iter() (resp. find_iter()) with the MatchKind:: LeftmostFirst specification provided by daachorse (resp. aho-corasick).

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 4.2 7.7 8.9 12.8 12.3
aho-corasick (DFA) 0.8 1.9 2.3 3.6 6.1
daachorse (bytewise) 3.5 3.6 3.7 4.1 3.0
daachorse (charwise) 3.0 3.4 3.3 3.2 2.3

Memory usage

We measured the memory usage [MiB] reported by heap_bytes() provided by the Aho-Corasick implementations. For fst and yada, we measured the data buffer size obtained by their constructors.

Library Word100 Word5K Word15K Word100K UniDic
aho-corasick (NFA) 0.071 0.667 1.5 12.8 63.5
aho-corasick (DFA) 0.276 17.268 44.0 763.4 2664.2
daachorse (bytewise) 0.013 0.426 1.0 8.6 28.2
daachorse (charwise) 0.016 0.549 1.3 11.2 25.3
fst 0.001 0.109 0.1 1.0 2.8
yada 0.005 0.346 0.3 2.9 9.4
Clone this wiki locally