This regex barometer defines a number of different benchmark models. Each benchmark model is generally intended to correspond to a "real world" regex task. Whether that be searching line-by-line like a grep or log searching tool, or searching large haystacks all at once (perhaps memory mapped from disk), or even just how long it takes to compile a regex.
The purpose of defining distinct models is to attempt to capture a broad range of regex tasks that represent or approximate real world workloads. Most regex benchmarks either define only one model, or even worse, combine multiple models together. Usually the only model measured by other regex benchmarks is "how many times does a regex match in a haystack." But real world tasks often involve extracting the positions that capturing groups match, and that task usually requires asking the regex engine to do more work.
A secondary purpose of defining distinct models is that many regex engines contain a number of different optimization techniques. Those optimization techniques may or may not apply depending on the workload. For example, Rust's regex crate might use a bounded backtracking algorithm to resolve capturing groups when searching short haystacks, but might otherwise not use this technique at all. Therefore, such an optimization is more likely to show up when resolving capturing groups while, say, searching the lines in a log file, but is less likely to show up when searching a document that is megabytes big all at once. A benchmark that doesn't include varied models will miss an opportunity to capture this nuance.
Of course, all models are wrong, but some are useful. Most regex benchmarks never set out to be exhaustive, and certainly this set of models is not exhaustive either. Balance is important. A single model is almost certainly the wrong balance to strike.
What follows is a description of each model and its motivation for existing.
This model measures the compile time of a regex. Implementations of this model only measure the time it takes to build a regex to the point that it is ready for a search. Implementations must also execute the regex on a haystack and report the number of matches as a verification step. This verification step is explicitly not measured as part of the timing of this benchmark, but if an incorrect count is reported, the measurement for that specific regex engine will fail.
This model is important because regex compilation sometimes matters. While not all programmers are careful about making sure regex compilation happens no more than it needs to, many programmers are used to the idea that one should try to compile a regex just once if they can, and then reuse that same regex for all searches. Namely, re-compiling the same regex in a loop is usually a performance footgun.
Some higher level environments, such as Python's re
module, will actually
cache the compilation of regexes such that one doesn't usually need to care
about when regex compilation happens at all. Instead, "just compile each regex
once" happens automatically without users of the regex library caring about
it. However, most lower level regex engine APIs do not include this sort of
automatic caching behavior.
Since it is generally expected that compiling a regex will take a longer time
than executing a search with that regex (for some reasonable haystack length),
and since many are used to the notion of building a regex once and reusing
it for the lifetime of a program, it follows that the compilation time of a
regex is usually not a major concern. With that said, this is only true if
compilation time is reasonable. For example, if a regex engine uses fully
compiled DFAs for all regexes, then it would not be hard (especially with large
Unicode character classes) to blow this "reasonableness" budget to the point
that a single regex might take multiple seconds to compile. (Or perhaps much
more, since DFAs take O(2^n)
time to build where n ~ len(pattern)
.)
Another reason regex compilation might matter is if the regex is being compiled in a context where a server is responding to a client request. In this case, the regex might be compiled once and only used once, such that compilation time, if it takes a long time, could have a material impact on the absolute response time of the request.
Overall, it is important to have some kind of barometer as to what compilation times look like for a particular regex engine. While the most obvious question is just a coarse, "are they reasonable," there are other use cases where compilation time might matter more.
This is currently the only model that measures compilation time of a regex.
(Except for perhaps regex-redux
, but in that case, the regexes are simple
enough and the haystack is big enough that compilation time doesn't factor into
it so long as it's reasonable.) Namely, in most benchmarks, search times are
so fast that if they included compilation time, then compilation time would
dominate and the signal from search time benchmarks would be greatly diminished
or snuffed out completely.
The count
model resembles what most other regex benchmarks do: it measures
the time it takes to find all matches in a single haystack. The verification
step simply confirms that the number of matches corresponds to what is
expected.
Like most other regex benchmarks (but not all), this does not include the time it takes to compile the regex.
This model is the "main" barometer of regex engine speed. It is especially good at measuring throughput (for large haystacks) and latency (for very short haystacks). If you only looked at one model in this regex barometer, this would be the one to look at.
Implementations of this model may use whatever techniques available to them to compute the total count of matches. For example, this might mean avoiding tracking capture group spans in a backtracker, or even avoiding finding the start of a match in automata oriented engines.
This model is like count
, except it returns a sum of the lengths of all
matches found in a single haystack. The verification step simply confirms that
the sum matches what is expected.
The length of a match should ideally be in terms of the number of bytes, but it is also permissible to count the number of code units. For example, .NET's regex engine can only run on sequences of UTF-16 code units, so using a length derived from anything other than UTF-16 code units implies an overhead cost that would otherwise be artificial to this benchmark. Benchmark definitions will need to account for this by specifying different counts expected for regex engines that count something other than individual bytes.
For example, given the regex [0-9]{2}|[a-z]
and the haystack 12a!!34
, the
total sum reported should be len(12) + len(a) + len(34) = 5
.
The purpose of this model is to force regex engines to compute the full bounds of a match, which includes both the start and end of the match. A regex engine may internally use any technique for reporting the bounds of a match. The important bit is that implementations of this model ask the regex engine for both the start and end of each match, and then sum the lengths of every match.
Having both count
and count-spans
models somewhat complicates the overall
barometer, but the inclusion of Hyperscan in this barometer nearly demands
that we do so. Namely, Hyperscan can have quite different performance
characteristics depending on whether the start of the match is requested from
it or not. Moreover, Hyperscan seems to have more stringent limits on the size
of regexes it allows when the start of match is requested. For this reason,
it's quite important to ensure we have a defined and supported way of modeling
the performance of regex engines on tasks that do not require the start of a
match.
This model is like count
, but instead of counting the number of matches,
it counts the number of matching capturing groups. For example, given the
regex ([0-9])([0-9])|([a-z])
and the haystack 12a34
, the total number of
matching capturing groups is 8
. Namely, there are 3 matches:
12
matches the first alternate, which contributes 1 implicit group (the overall match) and 2 explicit groups.a
matches the second alternate, which contributes 1 implicit group and 1 explicit group.34
matches the first alternate, which contributes 1 implicit group and 2 explicit groups.
That adds up to 1 + 2 + 1 + 1 + 1 + 2 = 8
.
The important bit to notice here is that not all capturing groups necessarily
participate in a match. So it isn't always enough to simply report 1 + number-of-capturing-groups-in-pattern
for every match. The regex engine needs
to actually compute which capturing groups participate in the match.
The purpose of this model is to measure timings for a very common regex task,
but one that can have a very different performance profile from the count
model. Namely, some regex engines need to do a lot of additional work to
report the spans for all capture groups that participate in a match. So even
if a regex engine appears fast in the count
model, it could be quite a bit
slower in the count-captures
model. If your primary use case includes using
capturing groups in your regex, then the count
model could give quite a
misleading impression.
To simplify implementations of the model, one may assume that any regex measured under this model will never match the empty string. In particular, it means that one can iterate over all successive matches in a haystack with very simple code:
count = 0
at = 0
loop:
captures = re.find_at(haystack, at)
if not captures.is_match():
break
for capture in captures:
if capture.is_match():
count += 1
# If one had to handle the case where
# at == captures.end() (i.e., the match
# was empty), then this would lead to
# an infinite loop.
at = captures.end()
Note that the count
benchmark model must handle the case of an empty match
correctly. If it's simpler to use a regex engine's iterator that handles the
empty match for you, then it's acceptable to use it here in this benchmark
model. (Its cost is likely neglible, although one may test that empirically and
use a simpler iteration protocol like the one above if that assumption proves
to be false.)
This model measures the time it takes to iterate over every line in a haystack and report the number of such lines that contain at least one match of the regex. The verification step compares the total number of matching lines, and not the total number of matches. So for example, a line that contains 3 matches of a regex will only contribute 1 to the overall count.
Approximate pseudo code for the benchmark looks like this. The comments explain some of the particulars:
regex = ...
haystack = ...
count = 0
while not haystack.is_empty():
line_end = haystack.find('\n')
if line_end == -1:
# If no more line terminators could be found,
# then the entire remaining portion of the haystack
# corresponds to the last line.
line_end = haystack.len()
line = haystack[0..line_end]
haystack = haystack[line_end..haystack.len()]
# This handles CRLF. If the line was terminated by \r\n,
# then we strip the \r too.
if not line.is_empty() and line[line.len()-1] == '\r':
line = line[0..line.len()-1]
# The regex engine shouldn't be given the line terminator.
if regex.is_match(line):
count += 1
print(count)
In this model, line iteration is actually included as part of the measurement.
Including line iteration in the measurement both simplifies the model and
more closely reflects reality. For example, if you can't separate Python's
re
regex engine from Python itself, and you need to accomplish the task of
grepping a log file, then line iteration should be part of what's measured
because it most closely resembles the real world task.
The purpose of this model is to capture a very common use case: iterate over
the lines of a file and do "something" with lines that match some regex
pattern. In this model, we leave out the "something" and instead just measure
how long it takes to find the set of lines that match a regex pattern. This of
course resembles the workload of the venerable grep
tool, hence the name of
the model.
Note that "fast" implementations of grep tend to not work by splitting the haystack into lines and then running a search on each line. Instead, they split the haystack into very large chunks and then run a regex search on the chunk instead of each line individually. This approach is taken primarily for performance, so why don't we use that model here? For a few reasons:
- It's not what is typically done for ad hoc simplistic re-implementations of grep. In my experience, when you need to find matching lines in a haystack, you just iterate over the lines and run the regex.
- The model is much more complex. It requires somewhat careful buffer
management and also some minor sophistication with the regex itself. Namely, in
order for this technique to work, you need to guarantee that the regex does not
match
\n
anywhere, or else it might report matches that span multiple lines, which would be an incorrect result for the high level task here. This means, for example, you need to rewrite regexes that contain\s
to instead use a character class that lacks\n
.
Thus, we stick with a very simple model that also has the benefit of reflecting real world use cases.
This model is similar to grep
in that it works by iterating over every line
in a haystack and searching it with a regex, except instead of reporting the
number of matching lines it reports the total number of matching capturing
groups. This includes multiple matches within the same line.
Approximate pseudo code for the benchmark looks like this. The comments explain some of the particulars:
regex = ...
haystack = ...
count = 0
while not haystack.is_empty():
line_end = haystack.find('\n')
if line_end == -1:
# If no more line terminators could be found,
# then the entire remaining portion of the haystack
# corresponds to the last line.
line_end = haystack.len()
line = haystack[0..line_end]
haystack = haystack[line_end..haystack.len()]
# This handles CRLF. If the line was terminated by \r\n,
# then we strip the \r too.
if not line.is_empty() and line[line.len()-1] == '\r':
line = line[0..line.len()-1]
# The regex engine shouldn't be given the line terminator.
# Remember, we need to find all matches within a line.
for captures in regex.captures(line):
for group in captures.groups():
# Not all groups necessarily participate in a match!
# So this must only include groups that match.
if group is not None:
count += 1
print(count)
The purpose of this model is very similar to grep
, but measures a task that
usually requires more work. That is, while "just finding matching lines" is a
common task unto itself, it is also common to not just find matching lines
but pluck actual data out of each line. For example, you might use a regex to
parse each line in an HTTP server log, and use capturing groups to extract
pieces of data like the HTTP response code, referrer and other things.
Just as with count-captures
, this model is useful because reporting the spans
of matching capturing groups usually requires more work than simply answering
whether a line matches or not. Given how common this sort of task is, it's
important to measure it as a distinct model.
Also as with count-captures
, implementations of this model may assume
that the regex being measured will never match the empty string. See the
count-captures
model for more details.
This is a port of the regex-redux benchmark from The Benchmark
Game. Unlike most other regex benchmarks, regex-redux
is more of
a holistic benchmark that doesn't focus each unit of measurement on a single
regex, but rather, a task that requires compiling and searching for many
different regexes.
Implementations of this model must check that their output is equivalent to the following string:
agggtaaa|tttaccct 6
[cgt]gggtaaa|tttaccc[acg] 26
a[act]ggtaaa|tttacc[agt]t 86
ag[act]gtaaa|tttac[agt]ct 58
agg[act]taaa|ttta[agt]cct 113
aggg[acg]aaa|ttt[cgt]ccct 31
agggt[cgt]aa|tt[acg]accct 31
agggta[cgt]a|t[acg]taccct 32
agggtaa[cgt]|[acg]ttaccct 43
1016745
1000000
547899
Implementations must also report the length of the input at the end of
execution (after all replacements have been made), and this is checked against
the count
field in the benchmark definition.
This is probably the weakest model in this particular regex barometer, but it is included due to the popularity of The Benchmark Game. It provides a way to connect the results for regex engines in this benchmark with the results in a different benchmark.
Do note though that the benchmarks aren't precisely equivalent. There are at least the following differences:
- Parallelism is forbidden.
- Unicode is disabled for all regex engines that participate since the benchmark definition from The Benchmark Game does not require any sort of Unicode support. Disabling Unicode support in the regex engines makes this more of an apples-to-apples comparison.
- The haystack used is smaller than the haystack used by the Benchmark Game. This makes running the benchmark faster, but the important bit is that the timings are not directly comparable.
This model embeds its own regexes, so the regex
field must be empty, i.e.,
regex = []
.