Skip to content

Commit

Permalink
main: initial version of rebar
Browse files Browse the repository at this point in the history
This was pulled out from the 'regex-cli' command as part of ongoing
regex-automata work and then almost nearly rewritten and reorganized.
  • Loading branch information
BurntSushi committed Mar 7, 2023
1 parent ec4a00a commit 35e99e8
Show file tree
Hide file tree
Showing 329 changed files with 3,452,112 additions and 1 deletion.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/target
/tmp
15 changes: 15 additions & 0 deletions .vim/coc-settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"rust-analyzer.linkedProjects": [
"engines/hyperscan/Cargo.toml",
"engines/pcre2/Cargo.toml",
"engines/re2/Cargo.toml",
"engines/regress/Cargo.toml",
"engines/rust/aho-corasick/Cargo.toml",
"engines/rust/memchr/Cargo.toml",
"engines/rust/regex-automata/Cargo.toml",
"engines/rust/regex/Cargo.toml",
"engines/rust/regex-old/Cargo.toml",
"engines/rust/regex-syntax/Cargo.toml",
"Cargo.toml"
]
}
56 changes: 56 additions & 0 deletions BIAS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
This documents what I believe to be the bias of this barometer. Bias is
important to explicity describe because bias influences how one might interpret
results. For example, if the author of the barometer is also the author of
one of the regex engines included in the baromer (as is the case here), then
it's reasonable to assume that bias may implicitly or explicitly influence the
results to favor that regex engine.

The following is a list of biases that I was able to think of. Contributions
expanding this list are welcome.

* As mentioned above, I ([@BurntSushi]) authored both this baromer and the
[Rust regex crate]. The fact that the regex crate does well in this barometer
should perhaps be treated suspiciously. For example, even assuming good faith,
I may have selected a set of benchmarks that I knew well, and have thus spent
time optimizing for.
* The barometer represents a _curation_ of benchmarks, which implies someone
had to make a decision about not only which benchmarks to include, but also
which to exclude. Even if I hadn't also authored a regex engine included in
this barometer, this selection process would still be biased. My hope is that
this can be mitigated over time as the curated benchmarks are refined. We
should not add to the curated set without bound, but I do expect modifications
to it to be made. Ideally, the curated set would somehow approximate the set
of all regular expressions being executed in the wild, but this is of course
difficult to ascertain. So we wind up having to make guesses, and thus, bias
is introduced. I've also attempted to mitigate this bias by orienting some
proportion of benchmarks on regexes I've found used in other projects. (And of
course, the selection of those benchmarks is surely biased as well.)
* The analysis presented for each benchmark is heavily geared towards the
`rust/regex` engine. This is because I know that engine the best. I've also
found it somewhat difficult to understand what other engines actually do. The
source code of most regex engines is actually quite difficult to casually
browse. I often find it most difficult to get a high level picture of what's
happening. With that said, profiling programs can usually lead one to make
educated guesses as to what an engine is doing. Nevertheless, I welcome
contributions for improving analysis.
* The specific set of [benchmark models](MODELS.md) used represents bias
in how things are actually measured. While we try to do better than other
regex benchmarks by including multiple different types of measurements, we
of course cannot account for everything. For example, one common technique
used in practice, especially with automata oriented regex engines, is to run
one simpler regex that might produce false positives and then another more
complex regex to eliminate the false positives that get by the first. This
might be because of performance, or simply because of a lack of features (like
look-around). Another example of a model that is not included is one that both
compiles and searches for a regex as a single unit of work. We instead split
this apart into separate "compile" and "search" models.
* The author of this barometer has a background principally in automata
oriented regex engines. For this reason, all benchmarks in this barometer
measure true _regular_ expressions. More than that, they essentially avoid
any fancy features that are not known how to implement efficiently, such as
arbitrary look-around. (Simple look-around assertions like `^`, `$` and `\b`
are used though.) This means that the barometer misses a whole classes of
regexes that are just not measured here at all.

[@BurntSushi]: https://github.com/BurntSushi
[Rust regex crate]: https://github.com/rust-lang/regex
Loading

0 comments on commit 35e99e8

Please sign in to comment.