Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sharing one regex across many threads can lead to big slowdowns due to mutex contention #934

Closed
BurntSushi opened this issue Dec 14, 2022 · 50 comments · Fixed by #1080
Closed
Labels

Comments

@BurntSushi
Copy link
Member

To reproduce, create a Cargo.toml with this:

[package]
name = "regex-contention-repro-work"
version = "0.1.0"
edition = "2021"

[[bin]]
name = "repro"
path = "main.rs"

[dependencies]
anyhow = "1.0.66"
regex = "1.7.0"

And in the same directory, create a main.rs containing:

use std::{
    io::Write,
    sync::Arc,
    time::{Duration, Instant},
};

use regex::{Regex, RegexBuilder};

const ITERS: usize = 100_000;
const PATTERN: &str = "";
const HAYSTACK: &str = "ZQZQZQZQ";

#[derive(Debug)]
struct Benchmark {
    re: Regex,
    threads: u32,
}

impl Benchmark {
    fn cloned(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        for _ in 0..self.threads {
            // When we clone the regex like this, it does NOT make a complete
            // copy of all of its internal state, but it does create an entirely
            // fresh pool from which to get mutable scratch space for each
            // search. Basically, a 'Regex' internally looks like this:
            //
            //   struct Regex {
            //     // Among other things, this contains the literal
            //     // prefilters and the Thompson VM bytecode
            //     // instructions.
            //     read_only: Arc<ReadOnly>,
            //     // Contains space used by the regex matcher
            //     // during search time. e.g., The DFA transition
            //     // table for the lazy DFA or the set of active
            //     // threads for the Thompson NFA simulation.
            //     pool: Pool<ScratchSpace>,
            //   }
            //
            // That is, a regex already internally uses reference counting,
            // so cloning it does not create an entirely separate copy of the
            // data. It's effectively free. However, cloning it does create
            // an entirely fresh 'Pool'. It specifically does not reuse pools
            // across cloned regexes, and it does this specifically so that
            // callers have a path that permits them to opt out of contention
            // on the pool.
            //
            // Namely, when a fresh pool is created, it activates a special
            // optimization for the first thread that accesses the pool. For
            // that thread gets access to a special value ONLY accessible to
            // that thread, where as all other threads accessing the pool get
            // sent through the "slow" path via a mutex. When a lot of threads
            // share the same regex **with the same pool**, this mutex comes
            // under very heavy contention.
            //
            // It is worth pointing out that the mutex is NOT held for the
            // duration of the search. Effectively what happens is:
            //
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       pop pointer out of the pool
            //       mutex unlock
            //   do a search
            //   is "first" thread optimization active?
            //   NO: mutex lock
            //       push pointer back into pool
            //       mutex unlock
            //
            // So in the case where "do a search" is extremely fast, i.e., when
            // the haystack is tiny, as in this case, the mutex contention ends
            // up dominating the runtime. As the number of threads increases,
            // the contention gets worse and worse and thus runtime blows up.
            //
            // But, all of that contention can be avoided by giving each thread
            // a fresh regex and thus each one gets its own pool and each
            // thread gets the "first" thread optimization applied. So the
            // internal access for the mutable scratch space now looks like
            // this:
            //
            //   is "first" thread optimization active?
            //   YES: return pointer to special mutable scratch space
            //   do a search
            //   is "first" thread optimization active?
            //   YES: do nothing
            //
            // So how to fix this? Well, it's kind of hard. The regex crate
            // used to use the 'thread_local' crate that optimized this
            // particular access pattern and essentially kept a hash table
            // keyed on thread ID. But this led to other issues. Specifically,
            // its memory usage scaled with the number of active threads using
            // a regex, where as the current approach scales with the number of
            // active threads *simultaneously* using a regex.
            //
            // I am not an expert on concurrent data structures though, so
            // there is likely a better approach. But the idea here is indeed
            // to make it possible to opt out of contention by being able to
            // clone the regex. Once you do that, there are **zero** competing
            // resources between the threads.
            //
            // Why not just do this in all cases? Well, I guess I would if I
            // could, but I don't know how. The reason why explicit cloning
            // permits one to opt out is that each thread is handed its own
            // copy of the regex and its own pool, and that is specifically
            // controlled by the caller. I'm not sure how to do that from
            // within the regex library itself, since it isn't really aware of
            // threads per se.
            let re = self.re.clone();
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }

    fn shared(&self) -> anyhow::Result<Duration> {
        let start = Instant::now();
        let mut handles = vec![];
        // We clone the regex into an Arc but then share it across all threads.
        // Each thread in turn competes with the single regex's shared memory
        // pool for mutable scratch space to use during a search. This is what
        // ultimately caused this 'shared' benchmark to be much slower than the
        // 'cloned' benchmark when run with many threads. Indeed, profiling it
        // reveals that most of the time is spent in the regex internal 'Pool'
        // type's 'get' and 'get_slow' methods.
        let re = Arc::new(self.re.clone());
        for _ in 0..self.threads {
            let re = Arc::clone(&re);
            handles.push(std::thread::spawn(move || {
                let mut matched = 0;
                for _ in 0..ITERS {
                    if re.is_match(HAYSTACK) {
                        matched += 1;
                    }
                }
                matched
            }));
        }
        let mut matched = 0;
        for h in handles {
            matched += h.join().unwrap();
        }
        assert!(matched > 0);
        Ok(Instant::now().duration_since(start))
    }
}

fn main() -> anyhow::Result<()> {
    let threads: u32 = std::env::var("REGEX_BENCH_THREADS")?.parse()?;
    let re = RegexBuilder::new(PATTERN)
        .unicode(false)
        .dfa_size_limit(50 * (1 << 20))
        .build()?;
    let benchmark = Benchmark { re, threads };
    let which = std::env::var("REGEX_BENCH_WHICH")?;
    let duration = match &*which {
        "cloned" => benchmark.cloned(),
        "shared" => benchmark.shared(),
        unknown => anyhow::bail!("unrecognized REGEX_BENCH_WHICH={}", unknown),
    };
    writeln!(std::io::stdout(), "{:?}", duration)?;
    Ok(())
}

Now build and run the benchmark:

$ cargo build --release 
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       6.5 ms ±   1.2 ms    [User: 55.1 ms, System: 3.8 ms]
  Range (min … max):     0.1 ms …  10.5 ms    254 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     530.5 ms ±  12.6 ms    [User: 1886.3 ms, System: 4994.7 ms]
  Range (min … max):   514.2 ms … 552.4 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   81.66 ± 15.52 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

As noted in the comments in the code above, the only difference between these two benchmarks is that cloned creates a fresh Regex for each thread where as shared uses the same Regex across multiple threads. This leads to contention on a single Mutex in the latter case and overall very poor performance. (See the code comments above for more info.)

We can confirm this by looking at a profile. Here is a screenshot from perf for the cloned benchmark:

cloned

And now for the shared benchmark:

shared

As we can see, in the shared benchmark, virtually all of the time is being spent locking and unlocking the mutex.

@BurntSushi BurntSushi added the bug label Dec 14, 2022
@BurntSushi
Copy link
Member Author

Note: A prerequisite to understanding this comment is probably to read the comments in the code snippet above.

I am unsure of how to fix this or whether it's even possible to. The regex crate used to use thread_local for this before going back to a simpler mutex scheme, but the way thread_local works led to big memory "leaks" in certain environments. Namely, thread_local's memory usage, AIUI, scales with the number regexes used across multiple threads while the regex crate's current implementation only scales with the number of regexes used simultaneously across multiple threads. So in the former case, if you have A LOT of threads running and using regexes, even if they aren't all being used simultaneously, then you wind up with exorbitant memory usage. But in the latter case, only the number of threads executing simultaneous regex searches matters for how much memory is used.

My guess is that in order to fix this, we will need a more sophisticated concurrent data structure for providing mutable scratch space while also employing some kind of compacting strategy.

Alternatively, we could mark this as wontfix and simply ask folks to clone their regexes before sending them to separate threads. A Regex already uses reference counting internally, so this won't copy all of the read-only data inside a Regex. But it does effectively put you back into the same memory usage profile as thread_local: the amount of mutable scratch space you'll end up using is proportional to the total number of threads using a regex.

@osiewicz
Copy link

Is exposing lower-level API that allows one to pass in scratch space an option at all?
My use case is that I have a regex that might be called from multiple threads, though only one thread at a time should be evaluating a given regex. I could store a scratch space externally and then pass that in, not having to go through an internal synchronization in the process.

@Alphare
Copy link

Alphare commented Dec 14, 2022

In case anyone has the same issue I did of using regex in a very parallel context with rayon (which AFAIK has no way of letting you control the initialization of a thread, which makes sense given its magic threadpool), here's the simple workaround that removes the contention that regex itself is responsible for:

struct TLSRegex {
    base: regex::bytes::Regex,
    local: thread_local::ThreadLocal<regex::bytes::Regex>,
}

impl TLSRegex {
    pub fn is_match(&self, path: &[u8]) -> bool {
        self.local.get_or(|| self.base.clone()).is_match(path)
    }
}

Basically: clone the Regex for every thread lazily on first access. As detailed above, this reuses all of the "read-only" compilation of the regex, only creating new scratch space for the DFA for each thread, so the overhead is actually low (negligible depending on your use-case? I haven't seen it show up in profiling).

This is using the thread_local crate mentioned above, which comes with the warnings laid out by @BurntSushi above about memory usage, by design.

@BurntSushi
Copy link
Member Author

BurntSushi commented Dec 14, 2022

Is exposing lower-level API that allows one to pass in scratch space an option at all?

It's technically an option, but not one I'd ever sign off on personally. The regex API is essentially duplicated for &str and &[u8]. Duplicating each of those again just to accept scratch space to work around a problem that one can already work around is not worth it.

But, yes, I do plan on exposing such an API! Just not in regex, but rather, in regex-automata. You can see what it might look like here: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/meta/struct.Regex.html#method.try_find (Excuse the lack of docs. I'm actively working on that component. You'll find more docs on the lower level regex engines. For example: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/hybrid/regex/struct.Regex.html#method.find

I'm also planning on exposing the underlying Pool type in regex-automata too: https://burntsushi.net/stuff/tmp-do-not-link-me/regex-automata/regex_automata/util/pool/struct.Pool.html

Although you will have to do some kind of synchronization if you're calling a regex from multiple threads and watch to share the same mutable scratch space across multiple threads. And if you're not running regexes simultaneously then there won't be any contention and you won't be bitten by this issue in the first place.

The regex-automata plans are tracked by #656.

only creating new scratch space for the DFA for each thread, so the overhead is actually low (negligible depending on your use-case? I haven't seen it show up in profiling).

Right. The thread_local crate will give you speed, as long as you can deal with its memory profile. (AFAIK, its memory profile is only an issue when you have lots of threads. If you keep your thread pool limited to some reasonable number then I don't think you can run into any problems. Problems only tended to appear in managed environments, e.g., using the regex crate from a Go program.)

@BurntSushi
Copy link
Member Author

Also note that the design space is larger than what I've implemented. RE2, for example, shares the same mutable scratch space across multiple threads simultaneous. But that means you now have atomics and mutexes and overhead inside the lazy DFA search code itself. It's not a path I want to go down personally.

@WGH-
Copy link

WGH- commented May 6, 2023

I believe 1.4.3 is the last version that doesn't perform poorly in multithreaded scenarios?

@BurntSushi
Copy link
Member Author

I wouldn't say that. The current version of regex doesn't "perform poorly in multi-threaded scenarios." That's far too broad of a statement.

regex 1.4.3 is the last release to have an optional dependency on thread_local. In some cases it can perform better in highly contended scenarios where you're searching short haystacks. But it can also result in exorbitant memory usage in some cases, which was why that strategy was removed.

You're also ignoring the fact that you can clone the Regex when sending it to another thread. This won't use any extra memory and it eliminates the contention problem.

@WGH-
Copy link

WGH- commented May 6, 2023

The current version of regex doesn't "perform poorly in multi-threaded scenarios." That's far too broad of a statement.

Right. I apologize for that.

In some cases it can perform better in highly contended scenarios where you're searching short haystacks. But it can also result in exorbitant memory usage in some cases, which was why that strategy was removed.

My application is exactly that: throwing lots of threads (rayon-style) at matching regexes.

And I also agree that unavoidable memory leaks (in some cases) is far worse than work-aroundable performance degradation.

Still, I think it's important to have a point of reference that one may compare to. I made a decision to downgrade and pin regex version for the time being (I initially noticed when I cargo updated everything), and maybe other people want to, as well.

@BurntSushi
Copy link
Member Author

Pinning to old and unsupported version of regex doesn't seem like a good idea and certainly not one that should be promoted or even suggested.

@koute
Copy link
Member

koute commented May 8, 2023

Alternatively, we could mark this as wontfix and simply ask folks to clone their regexes before sending them to separate threads.

Maybe it could be nice to have a feature on regex which would control whether Regexes are Sync? Otherwise this issue is somewhat of a trap and it's only going to get worse as more people get hugely mulicore machines. On my machine (32 cores/64 threads) the performance degradation is so bad that I'd probably go as far as calling it broken. It'd be nice to be able to statically make sure that the scratch space is never implicitly shared.

@BurntSushi
Copy link
Member Author

BurntSushi commented May 8, 2023

Maybe it could be nice to have a feature on regex which would control whether Regexes are Sync?

I'm pretty sure that this can never happen. It can be accomplished in two ways. The first is to add a new non-default feature that, when present, causes Regex to not implement Sync. Imagine a case where a library foo enables this feature. Then anyone depending on it and using regex elsewhere will also get that feature enabled. And that can in turn introduce a build failure that will be impossible to work-around without fixing foo.

The second way is to add a new feature sync that is enabled by default, that when turned off will result in Regex not implementing Sync. The problem there is that regex = { version = "1.8", default-features = false, features = ["std"] } is a supported configuration today. But with this new feature, folks using that configuration could very well end up getting a build failure because they don't have sync in the feature list.

So this idea is a dead end as far as I can see. Even if one did find a way to do this with crate features, I do not like this path at all. I would need some pretty compelling testimony to go down this path, including exhausting alternatives.

Otherwise this issue is somewhat of a trap and it's only going to get worse as more people get hugely mulicore machines.

The issue, AIUI, is not "uses a single regex from many threads simultaneously." My understanding is still that the unit of work has to be small, or else there shouldn't be an opportunity for so much contention.

broken

Can we please not do this hyperbole here? This word is waaaaaay over-used. Its use isn't going to have a positive effect here.

It'd be nice to be able to statically make sure that the scratch space is never implicitly shared.

Once #656 is done and regex-automata 0.3 is out, you should be able to do this with a meta::Regex from regex-automata. See the docs here: https://burntsushi.net/stuff/tmp-do-not-link-me/regex/regex_automata/meta/struct.Regex.html#synchronization-and-cloning

I'd also like to point out that nobody has tried to actually fix the performance problem. Literally zero attempts. I would love for an expert in concurrent data structures to sink their teeth into this problem. Once #656 is done, I can try to advertise this need more broadly. But I don't have the bandwidth to mentor it right now before #656.

@koute
Copy link
Member

koute commented May 8, 2023

Can we please not do this hyperbole here? This word is waaaaaay over-used. Its use isn't going to have a positive effect here.

Sorry, it wasn't my intention to use hyperbole, it's just, with all due respect, the performance difference between "finishes in 10 minutes" and "not finishes in 24 hours" (which I have personally encountered due to this issue; only once I left my PC running overnight and discovered in the morning that my program didn't finish I attached a profiler, analyzed it and ended up here in this issue) does qualify as pretty much broken in my opinion. If a program is so slow that it can't give me an answer in a reasonable time then it's just as if it couldn't give it to me at all, i.e. just as if it was broken. Especially since this is a fairly subtle trap that's relatively easy to trigger (especially with e.g. rayon-like concurrency) and relatively tricky to diagnose (it's not immediately obvious that any locks are involved, and you need to use a profiler to see what's going on). So this was not an attempt at hyperbole; from my perspective it literally broke my program because it couldn't give me an answer in a reasonable time. (: Sorry if it came out that way.

@BurntSushi
Copy link
Member Author

@koute It would just be better to focus on what's actually happening instead of just saying "broken." Pretty much everyone has a reason why they use the word "broken," but the word "broken" doesn't actually help anyone understand anything.

I would personally love to get a repro of your particular issue. "finishing in minutes" versus "finishing in days" is indeed a very big difference and I would love to have a reproduction for that so that myself (or others) have a way of getting a real use case on in their hands to optimize for.

I agree it's a trap that one can fall into, but again and I've said now, my understanding is that "just using rayon" isn't enough to provoke this issue. But maybe my mental model needs updating. Without reproductions, that's hard.

@koute
Copy link
Member

koute commented May 8, 2023

It would just be better to focus on what's actually happening instead of just saying "broken." Pretty much everyone has a reason why they use the word "broken," but the word "broken" doesn't actually help anyone understand anything.

Fair enough. Apologies. I should have immediately explained myself.

I agree it's a trap that one can fall into, but again and I've said now, my understanding is that "just using rayon" isn't enough to provoke this issue. But maybe my mental model needs updating. Without reproductions, that's hard.

AFAIK from what you've described your mental model of the issue is mostly correct, or at least it matches what I've seen. Basically, run on hugely multicore machine, saturate every hardware thread (I have 64 hardware threads) and do a lot of (small) regex matches.

I can't really share the code of my exact case, but I actually might have a reasonable real world test case for you, as it just happens that recently during the weekend I encountered this issue in one of the crates in the wild and put up a PR optimizing it.

  1. Set up a reproduction and download some test data:
$ cargo new --bin repro
$ cd repro
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 92f8d5d1e078cfafcbc60d52b6f6821a65ded422
$ cargo add rayon
$ curl "https://www.gutenberg.org/cache/epub/345/pg345.txt" -o src/dracula.txt
  1. Paste this in src/main.rs:
use rayon::prelude::*;

fn main() {
    let data = include_str!("dracula.txt");
    let det = lingua::LanguageDetectorBuilder::from_all_languages().build();
    let xs: Vec<_> = std::iter::repeat(data).take(100000).collect();
    xs.into_par_iter().for_each(|s| {
        let len = s.chars().map(|ch| ch.len_utf8()).take(2048).sum::<usize>();
        std::hint::black_box(det.detect_language_of(&s[..len]));
    });
}
  1. Run it:
$ cargo build --release && time cargo run --release

real	2m0.792s
user	23m49.315s
sys	96m4.691s
  1. Now replace the dependency with one which doesn't use regex for the hot path (so there's no lock contention happening):
$ cargo remove lingua
$ cargo add lingua --git https://github.com/koute/lingua-rs.git --rev 304cb5a113ddbb968fab2ef45e41b686e42e5aa8
  1. Rerun it:
$ cargo build --release && time cargo run --release

real	0m5.737s
user	5m23.120s
sys	0m0.247s

And just for reference, here are the numbers when running on a single thread (basically just replace into_par_iter with into_iter):

# Before my optimizations (uses regex):
real	3m39.142s
user	3m39.041s
sys	0m0.100s

# After my optimizations (doesn't use regex):
real	3m35.342s
user	3m35.204s
sys	0m0.043s

So when using regex jumping from 1 thread to 64 threads the program only got 1m39s faster (~54% of what it was) but wasted over an hour and half of CPU time. When not using regex jumping from 1 thread to 64 threads the program got 3m30s faster (~2% of what it was) and wasted less than two minutes of CPU time. When running single-threaded we can see that my changes didn't affect the non-contended case too much.

Of course these are the results on my machine (Threadripper 3970x, 256GB of 2667 MT/s RAM); your mileage may vary.

And here's the commit which got rid of regex usage so that you can see for what exactly that crate is using regex: koute/lingua-rs@304cb5a

You could argue that maybe it's not an appropriate use of regex, but nevertheless this is what I've encountered in the wild.

@BurntSushi
Copy link
Member Author

That's lovely, thank you!

And no, it's definitely a valid use. I'd love for Regex to "just work" from multiple threads simultaneously in all/most cases. Basically, we need something with the speed of the thread_local crate and the memory usage properties of the current implementation. So, e.g., that might mean a compacting thread_local or something. I dunno.

Maybe there is a simpler solution that I haven't thought of.

@himikof
Copy link

himikof commented May 17, 2023

Maybe the following simple fine-grained locking scheme would be a good enough compromise in practice?

  1. Track the current and maximum number of concurrent users for a given Regex using a pair of u32 in a single AtomicU64. This seems about the same overhead as cloning+dropping a single Arc on a Regex execution. An approximate number using only relaxed operations should be enough.
  2. Replace the stack: Mutex<Vec<Box<T>>> with multiple stacks: either a statically-sized array of 32-64 entries, or a growable "concurrent vector" similar to the one currently used in the thread_local implementation.
  3. Uniformly hash the thread ids into the first N stacks, where N is a function of the logarithm of the maximum number of concurrent users. Maybe using prime numbers for a better distribution. Each thread will only access a single "cache slot" (until the number N grows, but that would be infrequent), and the contention for a single lock will be greatly reduced.
  4. Optionally the Mutex-protected stacks could be replaced by a lock-free stack implementation. Not sure if the complexity would be worth it, though.

The current fast path for the creator thread could be used as-is in this scheme, and the allocation of the stacks could be delayed until the first slow path execution happens.

Using only the first N stacks increases the cache reuse when there are only a few concurrent users, even if the threads are short-lived, and allows the scheme to adjust itself as the contention grows.

@BurntSushi
Copy link
Member Author

I'd love to try (4) first just on its own. But it turns out the ABA problem is hard to avoid.

I don't quite grok your step (3), and specifically, its relationship to (1). Do we need to do (1)?

I'm also not sure if your strategy avoids the main downside of thread_local: that memory usage scales to the maximum number of threads using a Regex instead of the maximum number of threads simultaneously using a Regex.

@himikof
Copy link

himikof commented May 17, 2023

Step 1 is needed to explicitly know the number of threads simultaneously using a Regex, so that we can "scale up".
Essentially an atomic counter is incremented before acquiring a cache from the pool and decremented after releasing it, and the maximum value (M) of this counter for this regex object is kept, too.

For example, let's assume that N=log2(M)+11. Then a new object starts with a single "fast-path" slot. As soon as any non-creator thread accesses it, a single stack is used (N=1). If another thread uses this object afterwards, the top of this stack is reused. As soon as any two threads use it simultaneously, M=2 and N=2. Any two threads either use two different stacks, or a single stack: the worst case memory usage eventually is M*N ~ M log(M). It cannot grow any larger unless M grows (if there are more than two threads using it concurrently), regardless of the total number of threads that'd ever used it. And it is the worst case bound: if the thread id hash distribution is uniform and N is large, then the expected total memory usage is still O((M / N) * N) = O(M).

Footnotes

  1. I'm not sure that a simple logarithmic function is the best choice here, maybe it should start linear and only then become logarithmic, but the total memory usage is still only bounded by a function of M.

@BurntSushi
Copy link
Member Author

@himikof That does indeed sound promising! I invite folks to experiment with that in a PR after #656 lands. :-)

@BurntSushi
Copy link
Member Author

I keep seeing this pop up in various contexts. I'm going to take the conn here and try to work on the idea proposed above. I've seen way too many folks unfortunately contorting themselves to work-around this.

@BurntSushi
Copy link
Member Author

From the test case in the OP, here's where things stand in the status quo:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       3.5 ms ±   0.4 ms    [User: 24.7 ms, System: 5.3 ms]
  Range (min … max):     2.6 ms …   5.3 ms    557 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     518.1 ms ±   7.0 ms    [User: 1302.7 ms, System: 5529.3 ms]
  Range (min … max):   502.9 ms … 528.4 ms    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
  147.32 ± 18.57 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

If I replace the Mutex<Vec<Box<T>>> with SegQueue<Box<T>> from crossbeam, then it gets to:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       3.5 ms ±   0.4 ms    [User: 25.1 ms, System: 5.7 ms]
  Range (min … max):     2.9 ms …   5.0 ms    566 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     121.3 ms ±  10.9 ms    [User: 1497.0 ms, System: 2.5 ms]
  Range (min … max):   103.2 ms … 147.2 ms    23 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   34.90 ± 5.38 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

Which is a big improvement, but there's still a huge cliff.

Going to continue with @himikof's idea above.

(I tried crossbeam just as a way to get a point of reference using something that is lock-free. I'm not going to add a dependency on crossbeam for regex.)

@BurntSushi
Copy link
Member Author

I tried to get a sense of what the "best" case could be here. The test above uses 16 threads, so I used a Vec<Mutex<Vec<Box<T>>>> and computed the index by taking thread_id % 16. (A naive approach that won't provide our "memory usage scales with the maximum number of simultaneous uses" desire.) So in this approach, we still have the "owner" optimization, but each thread should wind up with its own stack and thus the mutex lock/unlock should always be uncontended. I verified this by replacing stack.lock().unwrap() with stack.try_lock().unwrap(). If there were any contention, I'd expect the latter to fail. The benchmark program does not fail. If I up to threads to 32, then I get a panic because at some point contention occurs (because there are only 16 stacks).

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       3.6 ms ±   0.5 ms    [User: 25.7 ms, System: 5.1 ms]
  Range (min … max):     2.8 ms …   5.7 ms    575 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):      32.5 ms ±   3.1 ms    [User: 376.5 ms, System: 2.5 ms]
  Range (min … max):    27.5 ms …  42.5 ms    83 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
    9.13 ± 1.48 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

So this ends up being a significant improvement, but it's still nearly 10 times slower than the "owner" path. It's an order of magnitude improvement, but still seems disheartening to be honest.

@BurntSushi
Copy link
Member Author

@himikof (or anyone else) - Is a simplistic hash function sufficient here? Specifically:

Uniformly hash the thread ids into the first N stacks, where N is a function of the logarithm of the maximum number of concurrent users. Maybe using prime numbers for a better distribution. Each thread will only access a single "cache slot" (until the number N grows, but that would be infrequent), and the contention for a single lock will be greatly reduced.

It seems like I can just do thread_id % N? Where N is the number of stacks. Thread IDs are allocated sequentially. Thoughts?

BurntSushi added a commit that referenced this issue Aug 30, 2023
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a little better by switching from one stack to eight stacks for
the pool. The stack is chosen by doing `self.stacks[thread_id % 8]`.
It's a pretty dumb strategy, but it limits extra memory usage while at
least reducing contention. Obviously, it works a lot better for the
8-16 thread case, and while it helps with the 64-128 thread case too,
things are still pretty slow there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR.

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro
  Time (mean ± σ):       2.3 ms ±   0.4 ms    [User: 9.4 ms, System: 3.1 ms]
  Range (min … max):     1.8 ms …   3.5 ms    823 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro
  Time (mean ± σ):     161.6 ms ±   8.0 ms    [User: 472.4 ms, System: 477.5 ms]
  Range (min … max):   150.7 ms … 176.8 ms    18 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran
   70.06 ± 11.43 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro'

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       3.5 ms ±   0.5 ms    [User: 26.1 ms, System: 5.2 ms]
  Range (min … max):     2.8 ms …   5.7 ms    576 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     433.9 ms ±   7.2 ms    [User: 1402.1 ms, System: 4377.1 ms]
  Range (min … max):   423.9 ms … 444.4 ms    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
  122.25 ± 15.80 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
```

And here's what it looks like after this PR:

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro
  Time (mean ± σ):       2.2 ms ±   0.4 ms    [User: 8.5 ms, System: 3.7 ms]
  Range (min … max):     1.7 ms …   3.4 ms    781 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro
  Time (mean ± σ):      24.6 ms ±   1.8 ms    [User: 141.0 ms, System: 1.2 ms]
  Range (min … max):    20.8 ms …  27.3 ms    116 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=8 ./target/release/repro' ran
   10.94 ± 2.05 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=8 ./target/release/repro'

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       3.6 ms ±   0.4 ms    [User: 26.8 ms, System: 4.4 ms]
  Range (min … max):     2.8 ms …   5.4 ms    574 runs

  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):      99.4 ms ±   5.4 ms    [User: 935.0 ms, System: 133.0 ms]
  Range (min … max):    85.6 ms … 109.9 ms    27 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   27.95 ± 3.48 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'
```

So instead of things getting over 123x slower in the 16 thread case, it
"only" gets 28x slower.

Other ideas for future work:

* Instead of a `Vec<Mutex<Vec<Cache>>>`, use a
`Vec<LockFreeStack<Cache>>`. I'm not sure this will fully resolve the
problem, but it's likely to make it better I think. AFAIK, the main
technical challenge here is coming up with a lock-free stack in the
first place that avoids the ABA problem. Crossbeam in theory provides
some primitives to help with this (epochs), but I don't want to add any
new dependencies.
* Think up a completely different approach to the problem. I'm drawing
a blank. (The `thread_local` crate is one such avenue, and the regex
crate actually used to use `thread_local` for exactly this. But
it led to huge memory usage in environments with lots of threads.
Specifically, I believe its memory usage scales with the total number
of threads that run a regex search, where as I want memory usage to
scale with the total number of threads *simultaneously* running a regex
search.)

Ref #934. If folks have insights or opinions, I'd appreciate if they
shared them in #934 instead of this PR. :-) Thank you!
@leviska
Copy link

leviska commented Aug 31, 2023

Did you test using spinlock for the stack, and not the mutex? It seems that operations on the stack should be very fast (pop/push == basically increment/decrement an integer, swap a pointer), so spinlock should not wait that long.

@BurntSushi
Copy link
Member Author

Spinlocks are basically impossible to do in user space. See: https://matklad.github.io/2020/01/02/spinlocks-considered-harmful.html

The non-std version of a Pool uses a spin lock because there just isn't a better choice (short of a lock free stack, but as mentioned, such things are difficult to implement correctly without a GC). Using it with std enabled would be a cure worse than the disease, even if it were faster. Besides, a mutex is likely already spinning to some degree before yielding to an OS primitive.

@leviska
Copy link

leviska commented Aug 31, 2023

What if instead of spinlock, we make cpu search for unused cache? Basically your implementation, but after lookup, if "our" cache is busy, we try to acquire other caches aswell? And if none are free we'll fallback to slow impl

@leviska
Copy link

leviska commented Aug 31, 2023

And also we could put the vec under rwlock for reallocation, and grow it when all caches are used, so it would scale with number of parallel executions, but not with core count.

@BurntSushi
Copy link
Member Author

I'm not sure I see how to make either of those ideas work. I don't think your description fully accounts for everything here, but I'm not sure. Do you want to experiment with those ideas? The benchmark I'm using is detailed in the OP of this issue.

@BurntSushi
Copy link
Member Author

For example, things like "we try to acquire other caches aswell" and "when all caches are used" seem like essential pieces to your ideas. I'm not sure how to deal with those efficiently.

@BurntSushi
Copy link
Member Author

An alternative approach that avoids needing to hash the thread ID is use a per-CPU cache. I describe a fairly general data structure for this purpose here: https://mcyoung.xyz/2023/03/29/rseq-checkout/#a-checkout-desk, which assumes you have access to an OS-specific cooperative multitasking feature (e.g. restartable sequences on Linux).

That is very cool. I hadn't heard of restartable sequences before. How do you cheaply get the current CPU number? That seems like the tricky bit.

Alternatively, you could still use a thread's CPU number as the key into the table of caches, but instead of doing it locklessly, cas out the pointers from the table.

Hmmmm, right! So instead of keeping a bunch of stacks around. I just keep one value per slot. If I stick with thread IDs, I pick the slot via thread ID, CAS out the pointer. If it's null, then I create a new cache. This sounds very plausible. I will experiment with this!

@leviska
Copy link

leviska commented Aug 31, 2023

I've misinterpreted current implementation, but after some time could get something faster on my 6-core ryzen 5 3600x remote machine (sadly, I don't have cpu with many threads right now)

Janky implementation (it 100% needs cleaning) is here ag/betta-pool...leviska:regex:ag/betta-pool
The idea is similar to thread_local: let's say that we have a cache for each thread
But then, instead of using this cache directly, let's lazily allocate new cache, only if all current are used

For stack I've fallen little back to stacks: Vec<Mutex<Option<Box<T>>>>
And the invariant is this (first 3 are obvious):

  1. If cache[i] is not allocated, stack[i] will store None
  2. For each index we allocate cache only once
  3. If someone using the cache, then it's locked
  4. And if all caches either empty, or locked, then we get first non locked index and allocate there

If I'm not mistaken, then from 4 we should always have allocated prefix and not allocated suffix

Then, if we search from start, and try_lock each cache, we should either:

  1. Find free allocated cache
  2. Find free not allocated cache, which will be the first index wise

The naive implementation with starting from 0 wasn't fast, and my assumption is that first N caches probably are always used, so I start from "random" id (just THREAD_ID), and go forward

But this can ruin our 4th invariant, so we first (fast-pass) try to get allocated cache from anywhere, and if we fail (slow-pass), we try to find first non used cache, and if it's not allocated, then allocate it

My benchmark results are something like this, but I'll agree that this need better testing

TLDR:
2 threads: 7.7ms (old) vs 6.7ms (new)
6 threads (core-native): 17.7ms (old) vs 11.4ms (new)
12 threads (thread-native): 65.1ms (old) vs 20.5ms (new)

Baseline

ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores" "REGEX_BENCH
_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores
  Time (mean ± σ):       4.1 ms ±   1.1 ms    [User: 6.8 ms, System: 0.6 ms]
  Range (min … max):     2.3 ms …   7.4 ms    619 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores
  Time (mean ± σ):       7.7 ms ±   1.5 ms    [User: 13.6 ms, System: 0.5 ms]
  Range (min … max):     4.5 ms …  11.1 ms    369 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores' ran
    1.85 ± 0.62 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores'







ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores
  Time (mean ± σ):       4.8 ms ±   1.1 ms    [User: 18.4 ms, System: 1.6 ms]
  Range (min … max):     2.4 ms …   8.8 ms    379 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores
  Time (mean ± σ):      17.7 ms ±   1.6 ms    [User: 84.1 ms, System: 0.7 ms]
  Range (min … max):    13.9 ms …  21.5 ms    137 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores' ran
    3.72 ± 0.95 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores'





ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores
  Time (mean ± σ):       5.0 ms ±   0.7 ms    [User: 39.5 ms, System: 1.6 ms]
  Range (min … max):     3.9 ms …   8.3 ms    377 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores
  Time (mean ± σ):      65.1 ms ±   4.0 ms    [User: 377.7 ms, System: 76.3 ms]
  Range (min … max):    53.0 ms …  73.0 ms    45 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores' ran
   13.05 ± 2.00 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores'

New

ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores" "REGEX_BENCH
_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores
  Time (mean ± σ):       4.1 ms ±   1.1 ms    [User: 6.4 ms, System: 0.8 ms]
  Range (min … max):     2.4 ms …   7.5 ms    634 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores
  Time (mean ± σ):       6.7 ms ±   1.2 ms    [User: 11.4 ms, System: 0.6 ms]
  Range (min … max):     4.1 ms …   9.3 ms    332 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=2 ../target/release/bench_cores' ran
    1.63 ± 0.51 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=2 ../target/release/bench_cores'




ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores
  Time (mean ± σ):       5.0 ms ±   1.0 ms    [User: 18.8 ms, System: 1.9 ms]
  Range (min … max):     2.5 ms …   7.5 ms    413 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores
  Time (mean ± σ):      11.4 ms ±   0.8 ms    [User: 58.9 ms, System: 0.8 ms]
  Range (min … max):     9.3 ms …  14.9 ms    195 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=6 ../target/release/bench_cores' ran
    2.29 ± 0.51 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=6 ../target/release/bench_cores'





ubuntu@main:~/regex/bench-cores$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores
  Time (mean ± σ):       5.2 ms ±   0.6 ms    [User: 41.0 ms, System: 1.3 ms]
  Range (min … max):     4.1 ms …   8.6 ms    338 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores
  Time (mean ± σ):      20.5 ms ±   1.5 ms    [User: 197.1 ms, System: 1.0 ms]
  Range (min … max):    15.6 ms …  23.9 ms    142 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=12 ../target/release/bench_cores' ran
    3.97 ± 0.54 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=12 ../target/release/bench_cores'

@mcy
Copy link

mcy commented Aug 31, 2023

How do you cheaply get the current CPU number? That seems like the tricky bit.

It's platform-dependent. If you're using a Linux rseq, the rseq struct where you stash you pc offsets also contains a kernel-updated cache of the CPU number. Otherwise, you call getcpu(2), which the kernel guarantees is implemented as efficiently as possible (on most platforms, it just hits the VDSO and never does a context switch). The thing to be aware of is that in any non-rseq setting, the CPU number you get back is not guaranteed to mean anything beyond "this thread executed on this core some time in the past". In practice, it should be fine for a key into a table as long as the table elements are mutated atomically.

If I stick with thread IDs, I pick the slot via thread ID, CAS out the pointer. If it's null, then I create a new cache.

Correct. Note that if you get a lot of threads hammering the same slot, they will effectively be allocating their own caches all of the time, which is probably unavoidable. Ideally you should try to minimize collisions, so that the threads=nprocs case is still fast and anything hammering the regex harder than that should consider copying the regex object.

The main benefit to using the CPU number instead in this case is that there is some pretense of that being somewhat more uniformly-distributed. What's nice about this is that if you can detect at runtime (e.g. by using dlsym to see if a particular glibc symbol exists) whether you support restartable sequences, you can branch to either using the CASing version or the rseq version.

@BurntSushi
Copy link
Member Author

@leviska Thanks for taking a crack at it! So there are two things that immediately jump out to me in your patch. First is that the mutex for a stack now gets held until the value is returned, which could be quite some time. It's unclear what kind of impact that could have, but it probably is not great. The second is that it looks like you've effectively introduced a spin-lock via a loop without bound that keeps trying Mutex::try_lock over and over again. We have to avoid spin locks unless there is quite literally no other choice. (I would take the status quo over a spin lock.)

@BurntSushi
Copy link
Member Author

@mcy I think restartable sequences are probably a bit too much for me to stomach at the moment, but I'll make a note of them in the comments as possible future work. I'm hoping to be able to make an improvement here without going down to super-platform-specific route.

The idea for using core ID makes a lot of sense to me. I totally buy your argument. Something platform specific on Linux using getcpu is light enough that I can stomach that. I'm still curious whether it will be cheap enough, but there's only one way to find out!

@leviska
Copy link

leviska commented Aug 31, 2023

First is that the mutex for a stack now gets held until the value is returned, which could be quite some time. It's unclear what kind of impact that could have, but it probably is not great.

I don't have mutex for stack at all... Do you mean individual cache mutex? If so, I don't understand why holding it for a long time can be bad

The second is that it looks like you've effectively introduced a spin-lock

Well yes and kinda no. The spinlock here is more about safety guarantees, and should be triggered very rarely, if any, so it should not affect performance: if you make the stack array very big (let's say infinitely big), then there will always be a free space to allocate, so the inner for cycle should always finish on first pass, and outer loop should never iterate twice. This was a proof of concept solution, and that part was not ideal

While making it infinitely big is impossible, in the current implementation the array is actually unbounded, so the length is kinda infinite, but in my implementation we should preallocate, instead of allocating on demand (I'm talking about the vec of pointers, not actual caches)

And we probably can assume nice upper bound for this array, something like numcpu + eps, because we probably should always have only numcpu executors at once (because there is no reason to have more in cpu bound task). Although there can be a problem, if a library user starts more threads, than there is in a system, and they all start to call regex, and that's the only way (I think) where spinlock will actually trigger.

I think it's a balance between allocating too much (if a person starts 1000 threads on 8 cpu machine, in which they all call regex, do we want to allocate 1000 caches or block until some threads finish?).
If we want to allocate, then we could probably introduce another stack just for that, with single mutex on it. Like, have 2 stacks: this (fast one), and second, which can be anything, for example your current implementation. Because if we have more threads (reminder, that we can allocate like 2*numcpu, so here it's "much more") than cores, is there a point in optimizing?
If we want to block, I think we could introduce some condvar, which will hold how many caches are free. Then, if all caches are used and we can't allocate new, we wait for condvar to change to 1 and notify_one. And we update it on acquiring the lock and releasing it

@BurntSushi
Copy link
Member Author

BurntSushi commented Aug 31, 2023

Do you mean individual cache mutex? If so, I don't understand why holding it for a long time can be bad

Yes. Because, as far as I can tell, it means that while N searches are running (where N==available_parallelism()), no other searches can run. Everyone else is blocked on those N searches.

You also have the problem that a MutexGuard can't be sent to other threads. Since these cache values are held for the duration of a regex iterator and since those iterators are (today) Send and Sync, we really just cannot have a MutexGuard returned by the pool here.

The spinlock here is more about safety guarantees, and should be triggered very rarely, if any, so it should not affect performance

Sorry to ask this, but did you read @matklad's blog that I linked? It isn't just about performance, it's about deadlocking and other perverse behaviors. See also this discussion where spin locks are talked about as well.

(If you want to further discuss spin locks, I suggest perhaps taking it to urlo or something. I don't want to get into a spinlock debate here. My current position is basically, "never use them unless there is literally no other choice." That doesn't apply in this situation, so I'm not going to use them.)

@mcy
Copy link

mcy commented Aug 31, 2023

I think restartable sequences are probably a bit too much for me to stomach at the moment

It is not for the faint of heart. If you ever do get interested, you can hmu on twitter (@/drawsmiguel, priv rn but feel free to request) or just poke junyer and ask for Miguel.

I'm still curious whether it will be cheap enough, but there's only one way to find out!

You pay a dynamic cross-module call usually. If you want to bound the number of indices, I suggest picking a power of two. If you're feeling spicy you can do a two-level radix tree (of, say, eight entries each level) where the outer level is allocated eagerly and the inner levels are allocated on-demand and initialized racily (as I do in my post).

Also junyer randomly shared https://www.usenix.org/system/files/atc19-dice.pdf which feels like it has relevant ideas. I haven't read it fully but a global table of thread_id ^ lock_addr is very interesting.

@leviska
Copy link

leviska commented Aug 31, 2023

Does this sounds better? ag/betta-pool...leviska:regex:leviska/pool
The same idea, just a little clearer

The benchmarks are +- the same as with my previous implementation, on 64 threads (with 6core/12threads cpu) both
(your baseline and this new) versions are +- the same

If you want to further discuss spin locks

As I said, spinlocks were just a hack to guarantee some value will be found. I didn't mean that they are good, just explained why I thought for demonstrating the idea it was just.. fast to write. But as I previously said, there are at least 2 methods how can you remove them, and I've implemented the first: just falling back to the old implementation.

You also have the problem that a MutexGuard can't be sent to other threads.

Because we actually never block on mutexes, we can replace them with much simpler non blocking mutex based on single atomic, and this mutex and it's guard is sync. I've implemented the basic idea, probably there's a better implementation in the open, just wanted to test.

it means that while N searches are running (where N==available_parallelism()), no other searches can run. Everyone else is blocked on those N searches.

I didn't get that, each search is independent to another, and yes, if all the caches are used, no other executors could run. In the old implementation I've implemented blocking-something-prototype way of handling out of caches situation, because it wasn't really relevant to the idea. This implementation fixes that.

@BurntSushi
Copy link
Member Author

@leviska Oh I see, I didn't realize the spin lock was a stand-in. I'll take a closer look at your update later. Thank you!

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Sep 1, 2023

I've tried to investigate the "lock-free stack" approach, using a 128-bit compare_exchange_weak() atomic operation with a counter to address the ABA problem. The diff is visible at https://github.com/rust-lang/regex/compare/ag/betta-pool...cosmicexplorer:regex:regex-cache-investigation?expand=1, and it depends on the extremely small lock-free stack impl at https://github.com/cosmicexplorer/minimal-lock-free-stack. I think this approach actually isn't as complex as feared, but unfortunately my initial attempt was much slower than the mutex approach, taking about 500ms in hyperfine:

# cosmicexplorer@lightning-strike-teravolts: ~/tools/regex/tmp-benchmark 16:20:45
; hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       5.7 ms ±   0.5 ms    [User: 65.2 ms, System: 2.0 ms]
  Range (min … max):     4.6 ms …   8.3 ms    506 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     516.6 ms ±  16.3 ms    [User: 7298.5 ms, System: 5.6 ms]
  Range (min … max):   483.7 ms … 532.4 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   90.81 ± 8.96 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

perf is also quite clear about the reason why: it's the latency of the 128-bit compare-exchange operation:

  41.92%  repro    repro              [.] portable_atomic::imp::x86_64::cmpxchg16b
  30.71%  repro    repro              [.] regex_automata::util::pool::inner::Pool<T,F>::get_slow
  23.83%  repro    repro              [.] minimal_lock_free_stack::Stack<T>::push
   0.72%  repro    repro              [.] <regex_automata::meta::strategy::Core as regex_automata::meta::strategy::Strategy>::search_half
   0.66%  repro    repro              [.] portable_atomic::imp::x86_64::atomic_load_vmovdqa
   0.60%  repro    repro              [.] regex_automata::hybrid::search::find_fwd
   0.57%  repro    repro              [.] regex::regex::string::Regex::is_match_at
   0.33%  repro    repro              [.] regex_automata::util::empty::skip_splits_fwd

I used the 128-bit cmpxchg16b x86 instruction because I didn't want to have to stuff bits in a pointer like crossbeam does, and because it offers greater protection against ABA, but it's possible that instruction is much slower than e.g. a 64-bit atomic instruction on my laptop. Anyone with an ARM box is free to try it out to see if their atomics are faster.

I'm going to try packing bits into the pointer like crossbeam does now and seeing if 64-bit atomics improve anything, but I'm not terribly hopeful. It seems like @leviska's approach or modifying Cache to allow lock-free updates like RE2 does are more likely to improve performance here.

EDIT: It's also possible I've used too-restrictive Orderings, but I don't believe that's the case. See:

  1. pop(): .compare_exchange_weak(cur_top, new_top, Ordering::AcqRel, Ordering::Acquire)
  2. push(): compare_exchange_weak(cur_top, new_top, Ordering::Release, Ordering::Relaxed)

BurntSushi added a commit that referenced this issue Sep 1, 2023
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```

And here's what it looks like after this PR:

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934
@BurntSushi
Copy link
Member Author

OK, so it turns out that false sharing appears to be the primary culprit here. Using multiple stacks and ensuring each mutex is in its own cache line leads to considerable improvement. For example, if I try @leviska's approach as-is, then I get:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       8.6 ms ±   0.5 ms    [User: 118.0 ms, System: 6.3 ms]
  Range (min … max):     7.4 ms …  10.5 ms    277 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):     178.7 ms ±  17.0 ms    [User: 3865.6 ms, System: 4.7 ms]
  Range (min … max):   146.0 ms … 213.7 ms    19 runs

But if I add a repr(C, align(64)) wrapper struct around its TryMutex, then I get:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       8.6 ms ±   0.6 ms    [User: 118.6 ms, System: 6.8 ms]
  Range (min … max):     7.4 ms …  10.6 ms    308 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      61.2 ms ±   8.8 ms    [User: 1272.5 ms, System: 4.9 ms]
  Range (min … max):    38.4 ms …  81.9 ms    47 runs

But it turns out I'm able to do as well (even a little better) by sticking with a simple mutex strategy combined with creating new values if there's contention (i.e., Mutex::try_lock fails):

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.1 ms ±   0.6 ms    [User: 127.7 ms, System: 6.0 ms]
  Range (min … max):     7.9 ms …  11.3 ms    279 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      53.8 ms ±   3.6 ms    [User: 1037.8 ms, System: 10.9 ms]
  Range (min … max):    48.4 ms …  62.7 ms    50 runs

(I also tried experimenting with an optimization idea from @Shnatsel where instead of creating a fresh Cache when under high contention, we instead clone an existing one. This has the benefit of potentially reusing some parts of the lazy DFA's transition table that had been previously computed. But in all my attempts for this, the timings were actually a little slower. This is somewhat surprising, but my best guess is that generating a fresh cache is perhaps not as expensive as I thought, and this optimization tends to add more pressure on a single atomic? Not sure.)

@cosmicexplorer

I think this approach actually isn't as complex as feared, but unfortunately my initial attempt was much slower than the mutex approach, taking about 500ms in hyperfine:

Yeah I have no doubt that it isn't so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying.

And yeah, interesting how much slower it is. Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that...

modifying Cache to allow lock-free updates like RE2 does

Yeah I explicitly decided long ago not to go down this route. The lazy DFA isn't the only thing that needs a mutable cache. And IIRC, RE2's lazy DFA is not lock free. I'm pretty sure there are mutexes inside of there. The downside of RE2's approach is that your lazy DFA becomes a lot more complicated IMO. As do the mutable scratch types for other engines (such as the PikeVM). And I don't actually know whether RE2 suffers from contention issues or not. Possibly not.

Anyway, given what's in #1080 now (just updated), we've gone from 215x slower for 64 threads:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'

To 6x slower:

$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'

That's good enough of a win for me. I just hope it doesn't introduce other unforeseen problems. Unfortunately, it can be somewhat difficult to observe them. (For example, excessive memory usage.)

BurntSushi added a commit that referenced this issue Sep 1, 2023
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```

And here's what it looks like after this PR:

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934
@Shnatsel
Copy link
Member

Shnatsel commented Sep 1, 2023

I understand the strategy of creating new DFAs will add a lot of variance depending on the exact regex and input used. The performance of repr(C, align(64)) mutexes should be more predictable. (although I may be biased here because I'm pretty sure I also proposed this approach).

(I also tried experimenting with an optimization idea from @Shnatsel where instead of creating a fresh Cache when under high contention, we instead clone an existing one. This has the benefit of potentially reusing some parts of the lazy DFA's transition table that had been previously computed. But in all my attempts for this, the timings were actually a little slower. This is somewhat surprising, but my best guess is that generating a fresh cache is perhaps not as expensive as I thought, and this optimization tends to add more pressure on a single atomic? Not sure.)

If the atomic is usually only read, not written or compare-exchanged, then AFAIK contention should not be an issue. Did you compare-exchange it every time? The case of the slot being empty should be uncommon, so it is probably faster to do a read first, and only compare-exchange if that finds that the slot is empty. That's 2 operations in the uncommon case, but read-only access in the common one.

Also, you could only try to stash it once in every N iterations, which both gives the DFA more time to get more mature and reduces contention.

Finally, it could also be suffering from false sharing. Try wrapping it in repr(C, align(64)) and see if that helps.

(none of these three items are mutually exclusive, they can and probably should be applied together)

@BurntSushi
Copy link
Member Author

If the atomic is usually only read, not written or compare-exchanged, then AFAIK contention should not be an issue. Did you compare-exchange it every time? The case of the slot being empty should be uncommon, so it is probably faster to do a read first, and only compare-exchange if that finds that the slot is empty. That's 2 operations in the uncommon case for the read-only common one.

Yes I did all that. Didn't cmpxchg every time. It's not my preferred explanation. I think it's just that cloning an existing lazy DFA is not as big of a win as I thought. Actually, of course, the benchmark I'm using is not stressing the lazy DFA's cache. So I'll need to tweak the benchmark and re-measure. Blech.

Finally, it could also be suffering from false sharing. Try wrapping it in repr(C, align(64)) and see if that helps.

Yeah but the shared atomic is shared across all threads where as the stacks are keyed by thread ID. I guess we could shard the caches that we copy from too? Like I said, I'm not a fan of the explanation that false sharing was impacting the cloning optimization. It's just something to consider and test.

@leviska
Copy link

leviska commented Sep 1, 2023

That's nice. I just want to add, that in my implementation we 100% could use less strict orderings (I've just used SeqCst everywhere, because I'm not that good at them) and probably gain something more perf, specially if false sharing makes such big difference
But your solution does have nice perf
What if we track how many caches we're allocated in total, and how much is used right now per stack? Something with two atomics. It shouldn't be exact, but can help us decide, do we want to allocate new, or block on mutex to reduce overall memory usage?

@BurntSushi
Copy link
Member Author

It's worth experimenting with, but after re-measuring @Shnatsel's cloning optimization with a better benchmark, I'm probably going to move on. The contention problem isn't fully solved, but it's in a much more palatable state.

But definitely encourage others to keep working on this. I'd be happy to merge simpler patches. (My general prior is that tricky unsafe should come with sizeable perf improvements.)

@cosmicexplorer
Copy link
Contributor

@BurntSushi:

Did you use a single stack shared across all threads? Or 8 sharded stacks? If the former, my guess based on my run-in with false sharing is exactly that...

A single stack shared across all threads!

Yeah I have no doubt that it isn't so complex if you use dependencies to either do it for you or give you access to the requisite primitives to do so (i.e., 128-bit atomics in this case). I really cannot personally justify taking a dependency for this, so it would mean re-rolling a lot of that soup. Not impossible, and can be done, but annoying.

Thanks so much for the clear feedback! If AtomicUsize is enough bits for crossbeam, then it's probably enough bits for us too, and that avoids the dependency on portable-atomic that I used for the 128-bit cmpxchg.

(My general prior is that tricky unsafe should come with sizeable perf improvements.)

I do not have enough experience with atomics to expect sizable perf improvements from this approach given the success we've just had here, but someone else who's more familiar with this sort of thing might be able to take the harness I created (see #934 (comment)) and improve it.

@cosmicexplorer
Copy link
Contributor

Hmmm, just a note: I think false sharing may indeed be relevant for the lock-free stack approach. As I increase the alignment number for individual linked list entries in the lock-free stack (each of which are allocated in a Box), the performance improves slightly, but reliably. When I make this change (cosmicexplorer/minimal-lock-free-stack@be269d8):

/* Align to 512 to decrease false sharing. */
#[repr(C, align(512))]
struct Node<T> {
  pub value: mem::ManuallyDrop<T>,
  pub next: *mut Node<T>,
}

We go from 90x -> 80x slower, or 500ms to 470ms (see #934 (comment) to compare to prior results):

# cosmicexplorer@lightning-strike-teravolts: ~/tools/regex/tmp-benchmark 18:46:30 
; hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):       5.9 ms ±   0.5 ms    [User: 68.3 ms, System: 2.0 ms]
  Range (min … max):     4.7 ms …   8.3 ms    418 runs
 
  Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the `-N`/`--shell=none` option to disable the shell completely.
 
Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro
  Time (mean ± σ):     471.5 ms ±  13.2 ms    [User: 6521.3 ms, System: 52.6 ms]
  Range (min … max):   449.8 ms … 487.3 ms    10 runs
 
Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=16 ./target/release/repro' ran
   79.69 ± 6.75 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=16 ./target/release/repro'

I'm having trouble with the bit twiddling necessary to use 64-bit atomics like crossbeam, and probably going to give up, but if the above behavior looks suspicious to anyone, please feel free to follow up.

BurntSushi added a commit that referenced this issue Sep 2, 2023
> **Context:** A `Regex` uses internal mutable space (called a `Cache`)
> while executing a search. Since a `Regex` really wants to be easily
> shared across multiple threads simultaneously, it follows that a
> `Regex` either needs to provide search functions that accept a `&mut
> Cache` (thereby pushing synchronization to a problem for the caller
> to solve) or it needs to do synchronization itself. While there are
> lower level APIs in `regex-automata` that do the former, they are
> less convenient. The higher level APIs, especially in the `regex`
> crate proper, need to do some kind of synchronization to give a
> search the mutable `Cache` that it needs.
>
> The current approach to that synchronization essentially uses a
> `Mutex<Vec<Cache>>` with an optimization for the "owning" thread
> that lets it bypass the `Mutex`. The owning thread optimization
> makes it so the single threaded use case essentially doesn't pay for
> any synchronization overhead, and that all works fine. But once the
> `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>`
> gets hit. And if you're doing a lot of regex searches on short
> haystacks in parallel, that `Mutex` comes under extremely heavy
> contention. To the point that a program can slow down by enormous
> amounts.
>
> This PR attempts to address that problem.
>
> Note that it's worth pointing out that this issue can be worked
> around.
>
> The simplest work-around is to clone a `Regex` and send it to other
> threads instead of sharing a single `Regex`. This won't use any
> additional memory (a `Regex` is reference counted internally),
> but it will force each thread to use the "owner" optimization
> described above. This does mean, for example, that you can't
> share a `Regex` across multiple threads conveniently with a
> `lazy_static`/`OnceCell`/`OnceLock`/whatever.
>
> The other work-around is to use the lower level search APIs on a
> `meta::Regex` in the `regex-automata` crate. Those APIs accept a
> `&mut Cache` explicitly. In that case, you can use the `thread_local`
> crate or even an actual `thread_local!` or something else entirely.

I wish I could say this PR was a home run that fixed the contention
issues with `Regex` once and for all, but it's not. It just makes
things a fair bit better by switching from one stack to eight stacks
for the pool, plus a couple other heuristics. The stack is chosen
by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy,
but it limits extra memory usage while at least reducing contention.
Obviously, it works a lot better for the 8-16 thread case, and while
it helps with the 64-128 thread case too, things are still pretty slow
there.

A benchmark for this problem is described in #934. We compare 8 and 16
threads, and for each thread count, we compare a `cloned` and `shared`
approach. The `cloned` approach clones the regex before sending it to
each thread where as the `shared` approach shares a single regex across
multiple threads. The `cloned` approach is expected to be fast (and
it is) because it forces each thread into the owner optimization. The
`shared` approach, however, hit the shared stack behind a mutex and
suffers majorly from contention.

Here's what that benchmark looks like before this PR for 64 threads (on a
24-core CPU).

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 128.3 ms, System: 5.7 ms]
  Range (min … max):     7.7 ms …  11.1 ms    278 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master
  Time (mean ± σ):      1.938 s ±  0.036 s    [User: 4.827 s, System: 41.401 s]
  Range (min … max):    1.885 s …  1.992 s    10 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
  215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master'
```

And here's what it looks like after this PR:

```
$ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro"
Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):       9.0 ms ±   0.6 ms    [User: 127.6 ms, System: 6.2 ms]
  Range (min … max):     7.9 ms …  11.7 ms    287 runs

Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro
  Time (mean ± σ):      55.0 ms ±   5.1 ms    [User: 1050.4 ms, System: 12.0 ms]
  Range (min … max):    46.1 ms …  67.3 ms    57 runs

Summary
  'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran
    6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro'
```

So instead of things getting over 215x slower in the 64 thread case, it
"only" gets 6x slower.

Closes #934
@BurntSushi
Copy link
Member Author

BurntSushi commented Sep 2, 2023

I'm hopeful that this is now largely fixed in regex 1.9.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants