Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automata: reduce regex contention considerably #1080

Merged
merged 1 commit into from
Sep 2, 2023
Merged

Conversation

BurntSushi
Copy link
Member

@BurntSushi BurntSushi commented Aug 30, 2023

Context: A Regex uses internal mutable space (called a Cache)
while executing a search. Since a Regex really wants to be easily
shared across multiple threads simultaneously, it follows that a
Regex either needs to provide search functions that accept a &mut Cache (thereby pushing synchronization to a problem for the caller
to solve) or it needs to do synchronization itself. While there are
lower level APIs in regex-automata that do the former, they are
less convenient. The higher level APIs, especially in the regex
crate proper, need to do some kind of synchronization to give a
search the mutable Cache that it needs.

The current approach to that synchronization essentially uses a
Mutex<Vec<Cache>> with an optimization for the "owning" thread
that lets it bypass the Mutex. The owning thread optimization
makes it so the single threaded use case essentially doesn't pay for
any synchronization overhead, and that all works fine. But once the
Regex is shared across multiple threads, that Mutex<Vec<Cache>>
gets hit. And if you're doing a lot of regex searches on short
haystacks in parallel, that Mutex comes under extremely heavy
contention. To the point that a program can slow down by enormous
amounts.

This PR attempts to address that problem.

Note that it's worth pointing out that this issue can be worked
around.

The simplest work-around is to clone a Regex and send it to other
threads instead of sharing a single Regex. This won't use any
additional memory (a Regex is reference counted internally),
but it will force each thread to use the "owner" optimization
described above. This does mean, for example, that you can't
share a Regex across multiple threads conveniently with a
lazy_static/OnceCell/OnceLock/whatever.

The other work-around is to use the lower level search APIs on a
meta::Regex in the regex-automata crate. Those APIs accept a
&mut Cache explicitly. In that case, you can use the thread_local
crate or even an actual thread_local! or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with Regex once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing self.stacks[thread_id % 8]. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a cloned and shared
approach. The cloned approach clones the regex before sending it to
each thread where as the shared approach shares a single regex across
multiple threads. The cloned approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
shared approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'

And here's what it looks like after this PR:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934

@film42
Copy link

film42 commented Aug 31, 2023

One question I posed on twitter was what is the best case scenario here where all threads are sharing the same Pool (which will be much worse than the cloned case). I didn't see that the thread_id assignment was made by this library to be an incrementing number which means the bench where 8 threads are testing in parallel is the best case. And, if you scale up to 16 threads with MAX_POOL_STACKS=16 you get similar performance as 8 threads. At 32, it begins to slow down further and I suspect it's cpu contention starting to build up on the Pool. This was run on a 16c/32t threadripper cpu.

Max pool stack @ 8. Including here as a reference for the benchmark running on my machine. THREADS=16 is expected to be slower (and match your PR comment), and it does.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/examples/repro
  Time (mean ± σ):      41.4 ms ±   6.7 ms    [User: 206.4 ms, System: 1.7 ms]
  Range (min … max):    30.2 ms …  57.4 ms    65 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/examples/repro
  Time (mean ± σ):     132.9 ms ±  10.5 ms    [User: 956.1 ms, System: 210.2 ms]
  Range (min … max):   113.0 ms … 147.9 ms    24 runs

Max pool stack @ 16. This matches the case where the max pool stack is 8 and you run with 8 threads. That's good.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/examples/repro
  Time (mean ± σ):      34.2 ms ±   4.8 ms    [User: 314.5 ms, System: 3.1 ms]
  Range (min … max):    26.9 ms …  43.0 ms    77 runs

Max pool stack @ 32. This is consistently ~2x slower than Max pool stack @ 16.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=32 ./target/release/examples/repro
  Time (mean ± σ):      81.9 ms ±   6.7 ms    [User: 1667.0 ms, System: 8.9 ms]
  Range (min … max):    66.4 ms …  94.4 ms    33 runs

So I think this strategy of sharing a pool at 8 threads is a great first step and should be included. If someone has a clever idea for growing the stack when concurrent callers exceed the pool size in a way that isn't slow to manage, that would be worth pursuing.

Even so, for multi-core apps that need "as fast as possible" throughput but also depend on regex, cloned is still 2-5x faster. It would be cool if there was a clippy lint that could warn you about this.

> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```

And here's what it looks like after this PR:

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934
@BurntSushi BurntSushi changed the title automata: reduce regex contention somewhat automata: reduce regex contention considerably Sep 1, 2023
BurntSushi added a commit that referenced this pull request Sep 2, 2023
See #1080 for thoughts on why
this doesn't seem to help.
@BurntSushi
Copy link
Member Author

I also experimented with another optimization here suggested by @Shnatsel where, instead of creating new pool values when it's under heavy contention, we clone from an existing pool value. The idea here is that cloning should be cheaper overall especially when it comes to the lazy DFA because it will avoid searches needing to re-compute the transition table. This optimization does complicate the implementation somewhat, but it's not terrible. This optimization also requires a different benchmark than the one in #934, since the one in #934 is just an empty regex and doesn't really benefit from having parts of the transition table pre-computed.

So I played around with that idea. I tweaked the benchmark to use \pL{10}[A-Z] as a pattern and πρςστυφχψωZ as a haystack (among other sizes). I chose that pattern and haystack because it should cause the lazy DFA to compute some new transition for each position in the haystack, thus making cache reuse as "valuable" as possible. With the cloning optimization, I was able to observe some very small improvement, but timings usually seemed to come out in a wash.

I wrote this on Discord explaining why I think the optimization doesn't help:

@Shnatsel my theory after trying some more inputs and thinking about
it a bit more is that there is a feedback loop. when contention is
low, cloning from an existing cache doesn't help because, well, the
cloning optimization only really kicks in when contention is high.
one thing that limits contention is longer search times. so as you
grow the size of the regex and/or haystack, search times increase
and contention drops because more time is being spent in the regex
engine. therefore, it seems likely that the cloning optimization is
only going to have a chance of helping when the search time is very
short. but when the search time is very short, there's less of an
opportunity (i think) for the cache reuse to actually help.

i have confirmed, via profiling, that without the cloning
optimization, the search itself is spending a bit more time building
the transition table. and that with the cloning optimization,
that part of the profile essentially disappears. so the cloning
optimization is behaving as intended, but it's just not helping as
much as i think we hoped it would. namely, even if it makes the
search times faster, that in turn is going to increase contention and
apparently negate the gains.

In short, there's a feedback loop where making search times faster doesn't necessarily lead to overall faster times because it increases contention.

Even though there is a plausibly small improvement in one benchmark, I've also seen it result in slower times overall. Since the win isn't clear and since it complicates the implementation, I'm going to forgo this optimization until new evidence emerges that suggests it's a good idea.

If anyone else wants to experiment, the optimization is implemented in the ag/betta-pool-clone-optimization branch. I probably won't keep that branch around forever, but I'll leave it up for now.

@BurntSushi BurntSushi merged commit f578d74 into master Sep 2, 2023
16 checks passed
@BurntSushi BurntSushi deleted the ag/betta-pool branch September 2, 2023 14:42
@BurntSushi BurntSushi restored the ag/betta-pool branch September 2, 2023 15:13
@BurntSushi BurntSushi deleted the ag/betta-pool branch September 2, 2023 15:14
@BurntSushi
Copy link
Member Author

This PR is on crates.io regex 1.9.5 and regex-automata 0.3.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sharing one regex across many threads can lead to big slowdowns due to mutex contention
2 participants