-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automata: reduce regex contention considerably #1080
Conversation
One question I posed on twitter was what is the best case scenario here where all threads are sharing the same Pool (which will be much worse than the cloned case). I didn't see that the Max pool stack @ 8. Including here as a reference for the benchmark running on my machine. THREADS=16 is expected to be slower (and match your PR comment), and it does.
Max pool stack @ 16. This matches the case where the max pool stack is 8 and you run with 8 threads. That's good.
Max pool stack @ 32. This is consistently ~2x slower than Max pool stack @ 16.
So I think this strategy of sharing a pool at 8 threads is a great first step and should be included. If someone has a clever idea for growing the stack when concurrent callers exceed the pool size in a way that isn't slow to manage, that would be worth pursuing. Even so, for multi-core apps that need "as fast as possible" throughput but also depend on regex, cloned is still 2-5x faster. It would be cool if there was a clippy lint that could warn you about this. |
2aa6744
to
142bc91
Compare
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934
142bc91
to
b3facd3
Compare
See #1080 for thoughts on why this doesn't seem to help.
I also experimented with another optimization here suggested by @Shnatsel where, instead of creating new pool values when it's under heavy contention, we clone from an existing pool value. The idea here is that cloning should be cheaper overall especially when it comes to the lazy DFA because it will avoid searches needing to re-compute the transition table. This optimization does complicate the implementation somewhat, but it's not terrible. This optimization also requires a different benchmark than the one in #934, since the one in #934 is just an empty regex and doesn't really benefit from having parts of the transition table pre-computed. So I played around with that idea. I tweaked the benchmark to use I wrote this on Discord explaining why I think the optimization doesn't help:
In short, there's a feedback loop where making search times faster doesn't necessarily lead to overall faster times because it increases contention. Even though there is a plausibly small improvement in one benchmark, I've also seen it result in slower times overall. Since the win isn't clear and since it complicates the implementation, I'm going to forgo this optimization until new evidence emerges that suggests it's a good idea. If anyone else wants to experiment, the optimization is implemented in the |
This PR is on crates.io |
I wish I could say this PR was a home run that fixed the contention
issues with
Regex
once and for all, but it's not. It just makesthings a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing
self.stacks[thread_id % 8]
. It's a pretty dumb strategy,but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.
A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a
cloned
andshared
approach. The
cloned
approach clones the regex before sending it toeach thread where as the
shared
approach shares a single regex acrossmultiple threads. The
cloned
approach is expected to be fast (andit is) because it forces each thread into the owner optimization. The
shared
approach, however, hit the shared stack behind a mutex andsuffers majorly from contention.
Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).
And here's what it looks like after this PR:
So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.
Closes #934