Comparison against alternative crates? #39

TheButlah · 2021-04-11T02:05:05Z

Hi, I'm considering using this crate but am unsure whether the performance is any better than the other SPSPC wait free ringbuffers. A comparison to crossbeam_channel might also be merited, since I could see using their bounded queues in a very similar way to the ringbuffer.

mgeier · 2021-04-11T09:48:26Z

Thanks for bringing this up! This is an interesting but complicated topic ...

First, the simple thing: SPSC is always faster than MPSC/SPMC/MPMC.
If not, the SPSC implementation sucks (or the other one is incorrect).
SPSC is just a much simpler problem with a much simpler solution. Not only is it less complex, it also needs much less code and is typically easier to understand. On the flip side, it's also much more limted.

All crossbeam queues are slower than this SPSC (and any reasonable other SPSC implementation).
That's why crossbeam-rs/crossbeam#338 was started, which also includes a few benchmarks.
The rtrb implementation is based on that PR, which was never merged.

When it comes to comparing SPSC implementations, you should consider multiple aspects:

correctness
API
performance

Correctness is very hard to check, because all SPSC implementations use a fair amount of unsafe code.
It's basically impossible to test, you'll have to analyze the code to convince yourself that it is correct.

I think that rtrb is correct, but who am I to say?

I think the most important difference between the existing SPSC crates is their API.
Many crates only work for u8 values, some work with arbitrary T.
The simplest API is to push() and pop() single items.
Some crates (e.g. rtrb and https://github.com/agerasev/ringbuf) allow to handle multiple items at once, which is more efficient than repeated push() and/or pop().
The most efficient way to write multiple items will involve an unsafe API (because of uninitialized memory).
IMHO the API of rtrb is nicer than ringbuf, but I'm of course very biased.

What kind of operations are you planning to use?
Which of the provided APIs do you prefer?

If there is something missing in the API of rtrb, please let me know!

Finally, performance ...

It's very hard to create meaningful benchmarks. I've created some in the benches directory, which can be run with cargo bench.
The single-threaded benchmarks are quite reliable and reproducible, but also mostly useless.
The more interesting part is the behavior when two threads are involved, but the benchmarks are potentially less reproducible due to unpredictable OS thread scheduling.

If you want to compare rtrb to other crates, you can copy the benchmarking code and adapt it accordingly.
Some time ago, I've started working on a comparison between crates (https://github.com/mgeier-forks/rtrb/tree/performance-comparison), but I haven't finished it. If you want, you can try it anyway, by running cargo bench in the performance-comparison subdirectory.

All this is probably meaningless, because the benchmark code will most likely not reflect your actual usage pattern, and the real-life performance differences between SPSC implementations are probably negligibly small anyway.

mgeier · 2021-04-11T11:35:20Z

Keeping in mind that all benchmarks are wrong, here is one result from running the two-threads benchmark from the https://github.com/mgeier-forks/rtrb/tree/performance-comparison branch on my laptop:

You should of course not trust the good result for rtrb, because I created the benchmark (and I selected the result from a few runs).

Changes of about 10% are common between runs, sometimes there are even 20% changes.

If you have any ideas how to make the results more stable (or more meaningful in general), please let me know!

ghost · 2021-04-12T02:55:35Z

my test result.

ghost · 2021-04-16T03:56:10Z

with

[profile.bench]
lto = true
opt-level = 3
codegen-units = 1

and tweaked my kernel, resulting in a very amazing speed.

mgeier · 2021-04-16T10:19:51Z

I've just tried it with the suggested [profile.bench] settings:

The differences are not that big, but it looks like the pop() timings improve in most crates.

Could that mean that maybe an #[inline] is missing in one of the functions in the pop() code path?

tweaked my kernel, resulting in a very amazing speed.

Are you talking about changes to the Linux scheduler?

This might have an influence on how the secondary thread (the one that generates contention on the atomic variables) is scheduled and therefore distort the measurements.

ghost · 2021-04-16T12:03:28Z

I followed the red hat low latency tuning guide, https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf, I disable the hyperthreading and turned all the kernel commands mentioned in doc. And use the low latency tuned-adm profile setting. I also use cpuset clear the core I use for benchmark.

mgeier · 2021-04-20T18:09:45Z

Thanks @zhenpingfeng for the information about latency tuning. This looks very promising and I'll have a closer look when I have more time.

In the meantime, I've modified the benchmark code a bit: #42.
I hope this makes it more reproducible and also more meaningful.

I also modified the performance comparison and created a new branch: https://github.com/mgeier-forks/rtrb/tree/performance-comparison2

mgeier · 2021-04-20T19:03:49Z

@TheButlah Coming back to your original question about crossbeam-channel ...

I tried my latest benchmark with crossbeam-queue and concurrent-queue:

I must say I'm quite surprised how fast they are!

In the uncontended case they are quite a bit slower than most SPSC implementations but faster than ringbuf!

In the contended case it's much closer. crossbeam-queue is only a little bit slower than the SPSC implementations, and concurrent-queue is even faster than ringbuf.

ghost · 2021-04-21T04:40:19Z

my benchmark2 result.

ghost · 2021-04-21T05:59:00Z

use rtrb::RingBuffer;
use std::thread;
use std::time::Instant;


fn main() {
    let (mut producer, mut consumer) = RingBuffer::<Instant>::new(2).split();

    thread::spawn(move || {
        loop {
            match consumer.pop().ok() {
                Some(now) => {
                    println!("{:?}", now.elapsed());
                }
                None => { continue; }
            }
        }
    });

    loop {
        let now = Instant::now();
        let _ = producer.push(now).is_ok();
    }
}

I have some questions. The elapsed time I get by using the above code is about 4us (FIFO scheduler). Is my method of using this library wrong? Or sending an Instant structure does take this amount of time? How can I reduce the latency to the nanosecond level?

New update:

fn main() {

    let num_msg: usize = 1_000_000;

    let (mut producer, mut consumer) = RingBuffer::<Instant>::new(2).split();
    
    thread::spawn(move || {
        let mut n = 0;
        loop {
            match consumer.pop().ok() {
                Some(now) => {
                    let _ = now.elapsed();
                    if n != num_msg {
                        n += 1;
                    } else {
                        break
                    }
                }
                None => { continue; }
            }
        }
    });

    let now = Instant::now();
    for n in 0..num_msg {
        while !producer.push(now).is_ok() {};
    }

    println!("avg {:?}", now.elapsed() / num_msg as u32);
}

After removing the println! marco, it now only cost around 120ns per send. problem solved.

mgeier · 2021-04-21T09:13:57Z

Thanks @zhenpingfeng for running the benchmarks again, the results seem pretty much consistent with mine, which is good!

The idea of sending an Instant through the ring buffer is brilliant!
However, in your updated code, this isn't really used anymore, right?
You might as well send any other payload since you are ignoring the value obtained from pop().

BTW, I think you could replace your match statement with if let, and the continue statement is superfluous since it is the last statement in the loop anyway.

ghost · 2021-04-21T10:13:36Z

Actually, the elapsed function call is worked, if I remove it, each sends only cost around 100ns.

mgeier · 2021-04-22T07:55:54Z

Actually, the elapsed function call is worked, if I remove it, each sends only cost around 100ns.

Yes, sure, the elapsed() method is called, but the result is ignored, so you might as well remove it.

We are not interested in benchmarking the elapsed() method, right?

If it takes 100ns without the elapsed() call, this perfectly matches your result from above (#39 (comment)), doesn't it?

ghost · 2022-08-21T15:36:42Z

An Update for rtrb 0.2.2 and rustc 1.65.0-nightly under 5.15.0-46-generic Ubuntu

mgeier · 2022-08-21T17:08:25Z

Thanks @zhenpingfeng for the updated measurement!

It's interesting that npnc seems to have gotten slightly worse in the contended case. But maybe that's just measurement inaccuracy.

BTW, there is a new branch with additional crates for performance comparison: https://github.com/mgeier-forks/rtrb/tree/performance-comparison3. And I've removed spsc-bounded-queue because it has been yanked from crates.io.

ghost · 2022-08-21T17:38:32Z

the performance-comparison3 result

mgeier · 2022-08-23T18:43:31Z

Here's the result on my laptop (Intel(R) Core(TM) i5-7Y54)

jmakov · 2023-07-04T07:09:24Z

Can we get this kind of pics in the README on the landing page? Super interesting.

mgeier · 2023-12-02T16:52:07Z

Thanks for the suggestion @jmakov! I'm hesitant to provide "official" plots in the README, because all benchmarks are wrong. But I'm adding a few sentences with a link to this issue here, where everyone can comment on their flaws and can provide opposing results.

See #107.

kasparthommen · 2024-04-24T18:11:15Z

Hi @mgeier , have you ever tried including Disruptor-inspired crates into your benchmarks, like any of the following?

mgeier · 2024-05-02T18:26:08Z

Thanks for the hint @kasparthommen, I have never heard of it!

I didn't quite understand the API though ... I have added the performance comparison to the codebase (see #123), would you like to add a PR adding the crates you suggested?

Speaking of which, I have updated the benchmarks recently, so I think it's time to share some plots again. I did those on Linux, with an Intel(R) Core(TM) i5-7Y54 CPU.

I split the benchmarks in two parts. One uses a very small buffer size (only 2 elements!), which means there is a lot of contention and many of the attempted read and write operations will fail:

The other benchmark uses a very large buffer size, and therefore no contention at all, so that every single intended read and write operation will succeed:

I think those are the worst-case and best-case scenarios, respectively, and any real use case will be somewhere in between.

Note that npnc is best in the uncontended case (likely because it uses the power-of-2 index wrapping trick), but it is significantly worse in the contended case.

On the other hand, omango is the best in the contended case, but it takes more than twice as long as the leaders in the uncontended case.

If anyone else wants to share their results (especially on other CPU architectures), please go ahead!

boranby · 2024-05-27T07:13:10Z

AMD EPYC 9374F
RHEL 9.4 RT

I ran two-threads-large without magnetic to see the others in more detail.

boranby · 2024-12-27T09:28:57Z

AMD EPYC 9374F
RHEL 9.5 RT
Heavily low latency tuned.

Pinned the benchmark push/pop threads on isolated cores. Different results on different cores due to L3 cache layout. AMD EPYC 9374F has total 256MB L3 cache but not a monolithic cache. Each 4 core has 32MB cache. I run the benchmarks on cores 4,5 which share the same L3 cache and cores 4,8 which don't share the same L3 cache.

Core 4,5 (Same L3)

Core 4,8 (Different L3)

mgeier · 2024-12-29T11:45:21Z

Thanks a lot @boranby for your benchmark results, that's very interesting!

It looks like the uncontended case is more or less the same (the larger variance in the second measurement might be random and might reduce when repeating the measurement?).
I guess this can be explained because there is hardly any communication between threads in that case, therefore the L3 layout doesn't matter much.

In the contended case, the time sometimes more than triples, wow!
The relative speeds of the different implementations seem to stay similar, though, with one big exception: the crossbeam queue beats everyone else (when running on separate L3 caches), which is very surprising (at least to me).

Any ideas why this may be?

One other thing (unrelated to the L3 cache layout) that I already noticed in your previous results (#39 (comment)) is that "rtrb" performs quite a lot worse that "crossbeam-pr338" in the contended case. I would expect (or at least hope for) no differences between the two, because "rtrb" is a fork of "pr338". I have made a few changes but I didn't intend to decrease the performance!
In the above measurements on Intel-CPUs, there is no big difference.

I don't know what causes the difference, but I would like to find out. One change that might be the culprit is #48. I have just created #132 which reverts this change.

@boranby could you please try to run the benchmark on #132 and post the results there?

If that's not it, does anyone have an idea what else it could be?

boranby · 2024-12-29T16:40:37Z

Hi @mgeier first of all thank you for your work on this crate!
My previous results #39 (comment) are probably run on the same core for producer and consumer and they might even run on the non isolated cores so it might be misleading.

Again running on isolated cores and running the #132 on my fork https://github.com/boranby/rtrb/tree/cache-head-and-tail:

Core 4,5 (Same L3)

Core 4,8 (Different L3)

Edit: I also run without the change to verify system and the result is almost same as #39 (comment) .

mgeier · 2024-12-29T18:16:27Z

Thanks for the new results for #132!

It looks like this indeed improved the performance of rtrb, hooray!

In the case of different L3 caches, it is still very slightly slower than "pr338", but this might be negligible.

If anyone has an idea how to further improve the performance of rtrb, please let me know!

mgeier mentioned this issue Apr 11, 2021

Performance vs other crates? agerasev/ringbuf#8

Open

rennergade mentioned this issue Oct 22, 2021

Optimize Pipes Lind-Project/lind_project#164

Closed

ERnsTL mentioned this issue Aug 21, 2022

Optimize message transfer performance between processes ERnsTL/flowd#203

Open

mgeier mentioned this issue May 2, 2024

Add performance comparison between crates #123

Merged

mgeier mentioned this issue May 29, 2024

performance comparison: remove "magnetic" #127

Merged

mgeier pinned this issue Aug 9, 2024

mgeier mentioned this issue Dec 29, 2024

Cache both head and tail index in both Consumer and Producer (again) #132

Merged

mgeier mentioned this issue Dec 29, 2024

DOC: "It is faster than rtrb" should be qualified tqtrungse/omango#8

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison against alternative crates? #39

Comparison against alternative crates? #39

TheButlah commented Apr 11, 2021

mgeier commented Apr 11, 2021

mgeier commented Apr 11, 2021

ghost commented Apr 12, 2021 •

edited by ghost

Loading

ghost commented Apr 16, 2021 •

edited by ghost

Loading

mgeier commented Apr 16, 2021

ghost commented Apr 16, 2021 •

edited by ghost

Loading

mgeier commented Apr 20, 2021

mgeier commented Apr 20, 2021

ghost commented Apr 21, 2021 •

edited by ghost

Loading

ghost commented Apr 21, 2021 •

edited by ghost

Loading

mgeier commented Apr 21, 2021

ghost commented Apr 21, 2021 •

edited by ghost

Loading

mgeier commented Apr 22, 2021

ghost commented Aug 21, 2022

mgeier commented Aug 21, 2022

ghost commented Aug 21, 2022 •

edited by ghost

Loading

mgeier commented Aug 23, 2022

jmakov commented Jul 4, 2023

mgeier commented Dec 2, 2023

kasparthommen commented Apr 24, 2024

mgeier commented May 2, 2024

boranby commented May 27, 2024

boranby commented Dec 27, 2024 •

edited

Loading

mgeier commented Dec 29, 2024

boranby commented Dec 29, 2024 •

edited

Loading

mgeier commented Dec 29, 2024

Comparison against alternative crates? #39

Comparison against alternative crates? #39

Comments

TheButlah commented Apr 11, 2021

mgeier commented Apr 11, 2021

mgeier commented Apr 11, 2021

ghost commented Apr 12, 2021 • edited by ghost Loading

ghost commented Apr 16, 2021 • edited by ghost Loading

mgeier commented Apr 16, 2021

ghost commented Apr 16, 2021 • edited by ghost Loading

mgeier commented Apr 20, 2021

mgeier commented Apr 20, 2021

ghost commented Apr 21, 2021 • edited by ghost Loading

ghost commented Apr 21, 2021 • edited by ghost Loading

mgeier commented Apr 21, 2021

ghost commented Apr 21, 2021 • edited by ghost Loading

mgeier commented Apr 22, 2021

ghost commented Aug 21, 2022

mgeier commented Aug 21, 2022

ghost commented Aug 21, 2022 • edited by ghost Loading

mgeier commented Aug 23, 2022

jmakov commented Jul 4, 2023

mgeier commented Dec 2, 2023

kasparthommen commented Apr 24, 2024

mgeier commented May 2, 2024

boranby commented May 27, 2024

boranby commented Dec 27, 2024 • edited Loading

Core 4,5 (Same L3)

Core 4,8 (Different L3)

mgeier commented Dec 29, 2024

boranby commented Dec 29, 2024 • edited Loading

Core 4,5 (Same L3)

Core 4,8 (Different L3)

mgeier commented Dec 29, 2024

ghost commented Apr 12, 2021 •

edited by ghost

Loading

ghost commented Apr 16, 2021 •

edited by ghost

Loading

ghost commented Apr 16, 2021 •

edited by ghost

Loading

ghost commented Apr 21, 2021 •

edited by ghost

Loading

ghost commented Apr 21, 2021 •

edited by ghost

Loading

ghost commented Apr 21, 2021 •

edited by ghost

Loading

ghost commented Aug 21, 2022 •

edited by ghost

Loading

boranby commented Dec 27, 2024 •

edited

Loading

boranby commented Dec 29, 2024 •

edited

Loading