-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sharing one regex across many threads can lead to big slowdowns due to mutex contention #934
Comments
Note: A prerequisite to understanding this comment is probably to read the comments in the code snippet above. I am unsure of how to fix this or whether it's even possible to. The My guess is that in order to fix this, we will need a more sophisticated concurrent data structure for providing mutable scratch space while also employing some kind of compacting strategy. Alternatively, we could mark this as |
Is exposing lower-level API that allows one to pass in scratch space an option at all? |
In case anyone has the same issue I did of using struct TLSRegex {
base: regex::bytes::Regex,
local: thread_local::ThreadLocal<regex::bytes::Regex>,
}
impl TLSRegex {
pub fn is_match(&self, path: &[u8]) -> bool {
self.local.get_or(|| self.base.clone()).is_match(path)
}
} Basically: clone the This is using the |
It's technically an option, but not one I'd ever sign off on personally. The regex API is essentially duplicated for But, yes, I do plan on exposing such an API! Just not in I'm also planning on exposing the underlying Although you will have to do some kind of synchronization if you're calling a regex from multiple threads and watch to share the same mutable scratch space across multiple threads. And if you're not running regexes simultaneously then there won't be any contention and you won't be bitten by this issue in the first place. The
Right. The |
Also note that the design space is larger than what I've implemented. RE2, for example, shares the same mutable scratch space across multiple threads simultaneous. But that means you now have atomics and mutexes and overhead inside the lazy DFA search code itself. It's not a path I want to go down personally. |
I believe 1.4.3 is the last version that doesn't perform poorly in multithreaded scenarios? |
I wouldn't say that. The current version of
You're also ignoring the fact that you can |
Right. I apologize for that.
My application is exactly that: throwing lots of threads (rayon-style) at matching regexes. And I also agree that unavoidable memory leaks (in some cases) is far worse than work-aroundable performance degradation. Still, I think it's important to have a point of reference that one may compare to. I made a decision to downgrade and pin |
Pinning to old and unsupported version of |
Maybe it could be nice to have a feature on |
I'm pretty sure that this can never happen. It can be accomplished in two ways. The first is to add a new non-default feature that, when present, causes The second way is to add a new feature So this idea is a dead end as far as I can see. Even if one did find a way to do this with crate features, I do not like this path at all. I would need some pretty compelling testimony to go down this path, including exhausting alternatives.
The issue, AIUI, is not "uses a single regex from many threads simultaneously." My understanding is still that the unit of work has to be small, or else there shouldn't be an opportunity for so much contention.
Can we please not do this hyperbole here? This word is waaaaaay over-used. Its use isn't going to have a positive effect here.
Once #656 is done and I'd also like to point out that nobody has tried to actually fix the performance problem. Literally zero attempts. I would love for an expert in concurrent data structures to sink their teeth into this problem. Once #656 is done, I can try to advertise this need more broadly. But I don't have the bandwidth to mentor it right now before #656. |
Sorry, it wasn't my intention to use hyperbole, it's just, with all due respect, the performance difference between "finishes in 10 minutes" and "not finishes in 24 hours" (which I have personally encountered due to this issue; only once I left my PC running overnight and discovered in the morning that my program didn't finish I attached a profiler, analyzed it and ended up here in this issue) does qualify as pretty much broken in my opinion. If a program is so slow that it can't give me an answer in a reasonable time then it's just as if it couldn't give it to me at all, i.e. just as if it was broken. Especially since this is a fairly subtle trap that's relatively easy to trigger (especially with e.g. |
@koute It would just be better to focus on what's actually happening instead of just saying "broken." Pretty much everyone has a reason why they use the word "broken," but the word "broken" doesn't actually help anyone understand anything. I would personally love to get a repro of your particular issue. "finishing in minutes" versus "finishing in days" is indeed a very big difference and I would love to have a reproduction for that so that myself (or others) have a way of getting a real use case on in their hands to optimize for. I agree it's a trap that one can fall into, but again and I've said now, my understanding is that "just using rayon" isn't enough to provoke this issue. But maybe my mental model needs updating. Without reproductions, that's hard. |
Fair enough. Apologies. I should have immediately explained myself.
AFAIK from what you've described your mental model of the issue is mostly correct, or at least it matches what I've seen. Basically, run on hugely multicore machine, saturate every hardware thread (I have 64 hardware threads) and do a lot of (small) regex matches. I can't really share the code of my exact case, but I actually might have a reasonable real world test case for you, as it just happens that recently during the weekend I encountered this issue in one of the crates in the wild and put up a PR optimizing it.
use rayon::prelude::*;
fn main() {
let data = include_str!("dracula.txt");
let det = lingua::LanguageDetectorBuilder::from_all_languages().build();
let xs: Vec<_> = std::iter::repeat(data).take(100000).collect();
xs.into_par_iter().for_each(|s| {
let len = s.chars().map(|ch| ch.len_utf8()).take(2048).sum::<usize>();
std::hint::black_box(det.detect_language_of(&s[..len]));
});
}
And just for reference, here are the numbers when running on a single thread (basically just replace
So when using Of course these are the results on my machine (Threadripper 3970x, 256GB of 2667 MT/s RAM); your mileage may vary. And here's the commit which got rid of You could argue that maybe it's not an appropriate use of |
That's lovely, thank you! And no, it's definitely a valid use. I'd love for Maybe there is a simpler solution that I haven't thought of. |
Maybe the following simple fine-grained locking scheme would be a good enough compromise in practice?
The current fast path for the creator thread could be used as-is in this scheme, and the allocation of the stacks could be delayed until the first slow path execution happens. Using only the first N stacks increases the cache reuse when there are only a few concurrent users, even if the threads are short-lived, and allows the scheme to adjust itself as the contention grows. |
I'd love to try (4) first just on its own. But it turns out the ABA problem is hard to avoid. I don't quite grok your step (3), and specifically, its relationship to (1). Do we need to do (1)? I'm also not sure if your strategy avoids the main downside of |
Step 1 is needed to explicitly know the number of threads simultaneously using a Regex, so that we can "scale up". For example, let's assume that Footnotes
|
I keep seeing this pop up in various contexts. I'm going to take the conn here and try to work on the idea proposed above. I've seen way too many folks unfortunately contorting themselves to work-around this. |
From the test case in the OP, here's where things stand in the status quo:
If I replace the
Which is a big improvement, but there's still a huge cliff. Going to continue with @himikof's idea above. (I tried crossbeam just as a way to get a point of reference using something that is lock-free. I'm not going to add a dependency on crossbeam for regex.) |
I tried to get a sense of what the "best" case could be here. The test above uses 16 threads, so I used a
So this ends up being a significant improvement, but it's still nearly 10 times slower than the "owner" path. It's an order of magnitude improvement, but still seems disheartening to be honest. |
@himikof (or anyone else) - Is a simplistic hash function sufficient here? Specifically:
It seems like I can just do |
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a little better by switching from one stack to eight stacks for the pool. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR. ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 2.3 ms ± 0.4 ms [User: 9.4 ms, System: 3.1 ms] Range (min … max): 1.8 ms … 3.5 ms 823 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 161.6 ms ± 8.0 ms [User: 472.4 ms, System: 477.5 ms] Range (min … max): 150.7 ms … 176.8 ms 18 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran 70.06 ± 11.43 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro' $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 3.5 ms ± 0.5 ms [User: 26.1 ms, System: 5.2 ms] Range (min … max): 2.8 ms … 5.7 ms 576 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 433.9 ms ± 7.2 ms [User: 1402.1 ms, System: 4377.1 ms] Range (min … max): 423.9 ms … 444.4 ms 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran 122.25 ± 15.80 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 2.2 ms ± 0.4 ms [User: 8.5 ms, System: 3.7 ms] Range (min … max): 1.7 ms … 3.4 ms 781 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro Time (mean ± σ): 24.6 ms ± 1.8 ms [User: 141.0 ms, System: 1.2 ms] Range (min … max): 20.8 ms … 27.3 ms 116 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran 10.94 ± 2.05 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro' $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 3.6 ms ± 0.4 ms [User: 26.8 ms, System: 4.4 ms] Range (min … max): 2.8 ms … 5.4 ms 574 runs Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely. Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro Time (mean ± σ): 99.4 ms ± 5.4 ms [User: 935.0 ms, System: 133.0 ms] Range (min … max): 85.6 ms … 109.9 ms 27 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran 27.95 ± 3.48 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro' ``` So instead of things getting over 123x slower in the 16 thread case, it "only" gets 28x slower. Other ideas for future work: * Instead of a `Vec<Mutex<Vec<Cache>>>`, use a `Vec<LockFreeStack<Cache>>`. I'm not sure this will fully resolve the problem, but it's likely to make it better I think. AFAIK, the main technical challenge here is coming up with a lock-free stack in the first place that avoids the ABA problem. Crossbeam in theory provides some primitives to help with this (epochs), but I don't want to add any new dependencies. * Think up a completely different approach to the problem. I'm drawing a blank. (The `thread_local` crate is one such avenue, and the regex crate actually used to use `thread_local` for exactly this. But it led to huge memory usage in environments with lots of threads. Specifically, I believe its memory usage scales with the total number of threads that run a regex search, where as I want memory usage to scale with the total number of threads *simultaneously* running a regex search.) Ref #934. If folks have insights or opinions, I'd appreciate if they shared them in #934 instead of this PR. :-) Thank you!
Did you test using spinlock for the stack, and not the mutex? It seems that operations on the stack should be very fast (pop/push == basically increment/decrement an integer, swap a pointer), so spinlock should not wait that long. |
Spinlocks are basically impossible to do in user space. See: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html The non-std version of a |
What if instead of spinlock, we make cpu search for unused cache? Basically your implementation, but after lookup, if "our" cache is busy, we try to acquire other caches aswell? And if none are free we'll fallback to slow impl |
And also we could put the vec under rwlock for reallocation, and grow it when all caches are used, so it would scale with number of parallel executions, but not with core count. |
I'm not sure I see how to make either of those ideas work. I don't think your description fully accounts for everything here, but I'm not sure. Do you want to experiment with those ideas? The benchmark I'm using is detailed in the OP of this issue. |
For example, things like "we try to acquire other caches aswell" and "when all caches are used" seem like essential pieces to your ideas. I'm not sure how to deal with those efficiently. |
That is very cool. I hadn't heard of restartable sequences before. How do you cheaply get the current CPU number? That seems like the tricky bit.
Hmmmm, right! So instead of keeping a bunch of stacks around. I just keep one value per slot. If I stick with thread IDs, I pick the slot via thread ID, CAS out the pointer. If it's null, then I create a new cache. This sounds very plausible. I will experiment with this! |
I've misinterpreted current implementation, but after some time could get something faster on my 6-core ryzen 5 3600x remote machine (sadly, I don't have cpu with many threads right now) Janky implementation (it 100% needs cleaning) is here ag/betta-pool...leviska:regex:ag/betta-pool For stack I've fallen little back to
If I'm not mistaken, then from 4 we should always have allocated prefix and not allocated suffix Then, if we search from start, and
The naive implementation with starting from 0 wasn't fast, and my assumption is that first N caches probably are always used, so I start from "random" id (just But this can ruin our 4th invariant, so we first (fast-pass) try to get allocated cache from anywhere, and if we fail (slow-pass), we try to find first non used cache, and if it's not allocated, then allocate it My benchmark results are something like this, but I'll agree that this need better testing TLDR: Baseline
New
|
It's platform-dependent. If you're using a Linux rseq, the rseq struct where you stash you pc offsets also contains a kernel-updated cache of the CPU number. Otherwise, you call getcpu(2), which the kernel guarantees is implemented as efficiently as possible (on most platforms, it just hits the VDSO and never does a context switch). The thing to be aware of is that in any non-rseq setting, the CPU number you get back is not guaranteed to mean anything beyond "this thread executed on this core some time in the past". In practice, it should be fine for a key into a table as long as the table elements are mutated atomically.
Correct. Note that if you get a lot of threads hammering the same slot, they will effectively be allocating their own caches all of the time, which is probably unavoidable. Ideally you should try to minimize collisions, so that the threads=nprocs case is still fast and anything hammering the regex harder than that should consider copying the regex object. The main benefit to using the CPU number instead in this case is that there is some pretense of that being somewhat more uniformly-distributed. What's nice about this is that if you can detect at runtime (e.g. by using dlsym to see if a particular glibc symbol exists) whether you support restartable sequences, you can branch to either using the CASing version or the rseq version. |
@leviska Thanks for taking a crack at it! So there are two things that immediately jump out to me in your patch. First is that the mutex for a stack now gets held until the value is returned, which could be quite some time. It's unclear what kind of impact that could have, but it probably is not great. The second is that it looks like you've effectively introduced a spin-lock via a loop without bound that keeps trying |
@mcy I think restartable sequences are probably a bit too much for me to stomach at the moment, but I'll make a note of them in the comments as possible future work. I'm hoping to be able to make an improvement here without going down to super-platform-specific route. The idea for using core ID makes a lot of sense to me. I totally buy your argument. Something platform specific on Linux using |
I don't have mutex for stack at all... Do you mean individual cache mutex? If so, I don't understand why holding it for a long time can be bad
Well yes and kinda no. The spinlock here is more about safety guarantees, and should be triggered very rarely, if any, so it should not affect performance: if you make the stack array very big (let's say infinitely big), then there will always be a free space to allocate, so the inner While making it infinitely big is impossible, in the current implementation the array is actually unbounded, so the length is kinda infinite, but in my implementation we should preallocate, instead of allocating on demand (I'm talking about the vec of pointers, not actual caches) And we probably can assume nice upper bound for this array, something like I think it's a balance between allocating too much (if a person starts 1000 threads on 8 cpu machine, in which they all call regex, do we want to allocate 1000 caches or block until some threads finish?). |
Yes. Because, as far as I can tell, it means that while You also have the problem that a
Sorry to ask this, but did you read @matklad's blog that I linked? It isn't just about performance, it's about deadlocking and other perverse behaviors. See also this discussion where spin locks are talked about as well. (If you want to further discuss spin locks, I suggest perhaps taking it to urlo or something. I don't want to get into a spinlock debate here. My current position is basically, "never use them unless there is literally no other choice." That doesn't apply in this situation, so I'm not going to use them.) |
It is not for the faint of heart. If you ever do get interested, you can hmu on twitter (@/drawsmiguel, priv rn but feel free to request) or just poke junyer and ask for Miguel.
You pay a dynamic cross-module call usually. If you want to bound the number of indices, I suggest picking a power of two. If you're feeling spicy you can do a two-level radix tree (of, say, eight entries each level) where the outer level is allocated eagerly and the inner levels are allocated on-demand and initialized racily (as I do in my post). Also junyer randomly shared https://www.usenix.org/system/files/atc19-dice.pdf which feels like it has relevant ideas. I haven't read it fully but a global table of |
Does this sounds better? ag/betta-pool...leviska:regex:leviska/pool The benchmarks are +- the same as with my previous implementation, on 64 threads (with 6core/12threads cpu) both
As I said, spinlocks were just a hack to guarantee some value will be found. I didn't mean that they are good, just explained why I thought for demonstrating the idea it was just.. fast to write. But as I previously said, there are at least 2 methods how can you remove them, and I've implemented the first: just falling back to the old implementation.
Because we actually never block on mutexes, we can replace them with much simpler non blocking mutex based on single atomic, and this mutex and it's guard is sync. I've implemented the basic idea, probably there's a better implementation in the open, just wanted to test.
I didn't get that, each search is independent to another, and yes, if all the caches are used, no other executors could run. In the old implementation I've implemented blocking-something-prototype way of handling out of caches situation, because it wasn't really relevant to the idea. This implementation fixes that. |
@leviska Oh I see, I didn't realize the spin lock was a stand-in. I'll take a closer look at your update later. Thank you! |
I've tried to investigate the "lock-free stack" approach, using a 128-bit # cosmicexplorer@lightning-strike-teravolts: ~/tools/regex/tmp-benchmark 16:20:45
; hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 5.7 ms ± 0.5 ms [User: 65.2 ms, System: 2.0 ms]
Range (min … max): 4.6 ms … 8.3 ms 506 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 516.6 ms ± 16.3 ms [User: 7298.5 ms, System: 5.6 ms]
Range (min … max): 483.7 ms … 532.4 ms 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
90.81 ± 8.96 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
I used the 128-bit I'm going to try packing bits into the pointer like EDIT: It's also possible I've used too-restrictive |
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934
OK, so it turns out that false sharing appears to be the primary culprit here. Using multiple stacks and ensuring each mutex is in its own cache line leads to considerable improvement. For example, if I try @leviska's approach as-is, then I get:
But if I add a
But it turns out I'm able to do as well (even a little better) by sticking with a simple mutex strategy combined with creating new values if there's contention (i.e.,
(I also tried experimenting with an optimization idea from @Shnatsel where instead of creating a fresh
Yeah I have no doubt that it isn't so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying. And yeah, interesting how much slower it is. Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that...
Yeah I explicitly decided long ago not to go down this route. The lazy DFA isn't the only thing that needs a mutable cache. And IIRC, RE2's lazy DFA is not lock free. I'm pretty sure there are mutexes inside of there. The downside of RE2's approach is that your lazy DFA becomes a lot more complicated IMO. As do the mutable scratch types for other engines (such as the PikeVM). And I don't actually know whether RE2 suffers from contention issues or not. Possibly not. Anyway, given what's in #1080 now (just updated), we've gone from 215x slower for 64 threads:
To 6x slower:
That's good enough of a win for me. I just hope it doesn't introduce other unforeseen problems. Unfortunately, it can be somewhat difficult to observe them. (For example, excessive memory usage.) |
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934
I understand the strategy of creating new DFAs will add a lot of variance depending on the exact regex and input used. The performance of
If the atomic is usually only read, not written or compare-exchanged, then AFAIK contention should not be an issue. Did you compare-exchange it every time? The case of the slot being empty should be uncommon, so it is probably faster to do a read first, and only compare-exchange if that finds that the slot is empty. That's 2 operations in the uncommon case, but read-only access in the common one. Also, you could only try to stash it once in every N iterations, which both gives the DFA more time to get more mature and reduces contention. Finally, it could also be suffering from false sharing. Try wrapping it in (none of these three items are mutually exclusive, they can and probably should be applied together) |
Yes I did all that. Didn't cmpxchg every time. It's not my preferred explanation. I think it's just that cloning an existing lazy DFA is not as big of a win as I thought. Actually, of course, the benchmark I'm using is not stressing the lazy DFA's cache. So I'll need to tweak the benchmark and re-measure. Blech.
Yeah but the shared atomic is shared across all threads where as the stacks are keyed by thread ID. I guess we could shard the caches that we copy from too? Like I said, I'm not a fan of the explanation that false sharing was impacting the cloning optimization. It's just something to consider and test. |
That's nice. I just want to add, that in my implementation we 100% could use less strict orderings (I've just used SeqCst everywhere, because I'm not that good at them) and probably gain something more perf, specially if false sharing makes such big difference |
It's worth experimenting with, but after re-measuring @Shnatsel's cloning optimization with a better benchmark, I'm probably going to move on. The contention problem isn't fully solved, but it's in a much more palatable state. But definitely encourage others to keep working on this. I'd be happy to merge simpler patches. (My general prior is that tricky |
A single stack shared across all threads!
Thanks so much for the clear feedback! If
I do not have enough experience with atomics to expect sizable perf improvements from this approach given the success we've just had here, but someone else who's more familiar with this sort of thing might be able to take the harness I created (see #934 (comment)) and improve it. |
Hmmm, just a note: I think false sharing may indeed be relevant for the lock-free stack approach. As I increase the alignment number for individual linked list entries in the lock-free stack (each of which are allocated in a /* Align to 512 to decrease false sharing. */
#[repr(C, align(512))]
struct Node<T> {
pub value: mem::ManuallyDrop<T>,
pub next: *mut Node<T>,
} We go from 90x -> 80x slower, or 500ms to 470ms (see #934 (comment) to compare to prior results): # cosmicexplorer@lightning-strike-teravolts: ~/tools/regex/tmp-benchmark 18:46:30
; hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 5.9 ms ± 0.5 ms [User: 68.3 ms, System: 2.0 ms]
Range (min … max): 4.7 ms … 8.3 ms 418 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
Time (mean ± σ): 471.5 ms ± 13.2 ms [User: 6521.3 ms, System: 52.6 ms]
Range (min … max): 449.8 ms … 487.3 ms 10 runs
Summary
'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
79.69 ± 6.75 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro' I'm having trouble with the bit twiddling necessary to use 64-bit atomics like crossbeam, and probably going to give up, but if the above behavior looks suspicious to anyone, please feel free to follow up. |
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934
I'm hopeful that this is now largely fixed in |
To reproduce, create a
Cargo.toml
with this:And in the same directory, create a
main.rs
containing:Now build and run the benchmark:
As noted in the comments in the code above, the only difference between these two benchmarks is that
cloned
creates a freshRegex
for each thread where asshared
uses the sameRegex
across multiple threads. This leads to contention on a singleMutex
in the latter case and overall very poor performance. (See the code comments above for more info.)We can confirm this by looking at a profile. Here is a screenshot from
perf
for thecloned
benchmark:And now for the
shared
benchmark:As we can see, in the
shared
benchmark, virtually all of the time is being spent locking and unlocking the mutex.The text was updated successfully, but these errors were encountered: