-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison against alternative crates? #39
Comments
Thanks for bringing this up! This is an interesting but complicated topic ... First, the simple thing: SPSC is always faster than MPSC/SPMC/MPMC. All When it comes to comparing SPSC implementations, you should consider multiple aspects:
Correctness is very hard to check, because all SPSC implementations use a fair amount of I think that I think the most important difference between the existing SPSC crates is their API. What kind of operations are you planning to use? If there is something missing in the API of Finally, performance ... It's very hard to create meaningful benchmarks. I've created some in the benches directory, which can be run with If you want to compare All this is probably meaningless, because the benchmark code will most likely not reflect your actual usage pattern, and the real-life performance differences between SPSC implementations are probably negligibly small anyway. |
Keeping in mind that all benchmarks are wrong, here is one result from running the You should of course not trust the good result for Changes of about 10% are common between runs, sometimes there are even 20% changes. If you have any ideas how to make the results more stable (or more meaningful in general), please let me know! |
I've just tried it with the suggested The differences are not that big, but it looks like the Could that mean that maybe an
Are you talking about changes to the Linux scheduler? This might have an influence on how the secondary thread (the one that generates contention on the atomic variables) is scheduled and therefore distort the measurements. |
I followed the red hat low latency tuning guide, https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf, I disable the hyperthreading and turned all the kernel commands mentioned in doc. And use the low latency tuned-adm profile setting. I also use cpuset clear the core I use for benchmark. |
Thanks @zhenpingfeng for the information about latency tuning. This looks very promising and I'll have a closer look when I have more time. In the meantime, I've modified the benchmark code a bit: #42. I also modified the performance comparison and created a new branch: https://github.com/mgeier-forks/rtrb/tree/performance-comparison2 |
@TheButlah Coming back to your original question about I tried my latest benchmark with I must say I'm quite surprised how fast they are! In the uncontended case they are quite a bit slower than most SPSC implementations but faster than In the contended case it's much closer. |
I have some questions. The elapsed time I get by using the above code is about 4us (FIFO scheduler). Is my method of using this library wrong? Or sending an Instant structure does take this amount of time? How can I reduce the latency to the nanosecond level? New update:
After removing the println! marco, it now only cost around 120ns per send. problem solved. |
Thanks @zhenpingfeng for running the benchmarks again, the results seem pretty much consistent with mine, which is good! The idea of sending an BTW, I think you could replace your |
Actually, the elapsed function call is worked, if I remove it, each sends only cost around 100ns. |
Yes, sure, the We are not interested in benchmarking the If it takes 100ns without the |
Thanks @zhenpingfeng for the updated measurement! It's interesting that BTW, there is a new branch with additional crates for performance comparison: https://github.com/mgeier-forks/rtrb/tree/performance-comparison3. And I've removed |
|
Can we get this kind of pics in the README on the landing page? Super interesting. |
Thanks for the hint @kasparthommen, I have never heard of it! I didn't quite understand the API though ... I have added the performance comparison to the codebase (see #123), would you like to add a PR adding the crates you suggested? Speaking of which, I have updated the benchmarks recently, so I think it's time to share some plots again. I did those on Linux, with an Intel(R) Core(TM) i5-7Y54 CPU. I split the benchmarks in two parts. One uses a very small buffer size (only 2 elements!), which means there is a lot of contention and many of the attempted read and write operations will fail: The other benchmark uses a very large buffer size, and therefore no contention at all, so that every single intended read and write operation will succeed: I think those are the worst-case and best-case scenarios, respectively, and any real use case will be somewhere in between. Note that On the other hand, If anyone else wants to share their results (especially on other CPU architectures), please go ahead! |
AMD EPYC 9374F Pinned the benchmark push/pop threads on isolated cores. Different results on different cores due to L3 cache layout. AMD EPYC 9374F has total 256MB L3 cache but not a monolithic cache. Each 4 core has 32MB cache. I run the benchmarks on cores 4,5 which share the same L3 cache and cores 4,8 which don't share the same L3 cache. Core 4,5 (Same L3)Core 4,8 (Different L3) |
Thanks a lot @boranby for your benchmark results, that's very interesting! It looks like the uncontended case is more or less the same (the larger variance in the second measurement might be random and might reduce when repeating the measurement?). In the contended case, the time sometimes more than triples, wow! Any ideas why this may be? One other thing (unrelated to the L3 cache layout) that I already noticed in your previous results (#39 (comment)) is that "rtrb" performs quite a lot worse that "crossbeam-pr338" in the contended case. I would expect (or at least hope for) no differences between the two, because "rtrb" is a fork of "pr338". I have made a few changes but I didn't intend to decrease the performance! I don't know what causes the difference, but I would like to find out. One change that might be the culprit is #48. I have just created #132 which reverts this change. @boranby could you please try to run the benchmark on #132 and post the results there? If that's not it, does anyone have an idea what else it could be? |
Hi @mgeier first of all thank you for your work on this crate! Again running on isolated cores and running the #132 on my fork https://github.com/boranby/rtrb/tree/cache-head-and-tail: Core 4,5 (Same L3)Core 4,8 (Different L3)Edit: I also run without the change to verify system and the result is almost same as #39 (comment) . |
Thanks for the new results for #132! It looks like this indeed improved the performance of In the case of different L3 caches, it is still very slightly slower than "pr338", but this might be negligible. If anyone has an idea how to further improve the performance of |
Hi, I'm considering using this crate but am unsure whether the performance is any better than the other SPSPC wait free ringbuffers. A comparison to crossbeam_channel might also be merited, since I could see using their bounded queues in a very similar way to the ringbuffer.
The text was updated successfully, but these errors were encountered: