-
-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of std::atomic can slow down multithreaded tests #452
Comments
Here is a reproducer: #include <doctest.h>
#include <thread>
#include <vector>
// 4.364 seconds: std::atomic<int> numAssertsFailedCurrentTest_atomic
// 0.755 seconds: MultiLaneAtomic<int> numAssertsCurrentTest_atomic
TEST_CASE("MultiLaneAtomic") {
static constexpr auto numIters = size_t(10000000);
auto threads = std::vector<std::thread>();
for (size_t i = 0; i < std::thread::hardware_concurrency(); ++i) {
threads.emplace_back([] {
for (size_t it = 0; it < numIters; ++it) {
REQUIRE(it < numIters);
}
});
}
for (auto& thread : threads) {
thread.join();
}
} |
It is my understanding that this solution would work even if there are more threads/cores than the 32 lanes (by default) - in every Regarding the timing - the example code ran on my machine for 4.5 seconds vs 3.2 seconds with this atomic class, so there was some gain indeed. Maybe the 3.5-to-1 difference you are observing is more exaggerated because there are more cores/threads on your machine? I tested with 12 cores - maybe you have a lot more? If that's the case (bigger payoff for higher multi-threadedness) maybe it's worth going forward with a PR - even if everyone has to pay the 10% increase when not using concurrency. It would also be trivial to surround the entire atomic class with an I think the lanes should be a compile-time define so that it's configurable (something like You can put this new atomic class right above |
Exactly, that's the idea. The goal is to spread the threads up between the lanes, and if some threads use the same lane it's not a problem, just a bit of a slowdown between these threads. What compiler are you using? I saw on godbolt that visual studio seems to produce much more code than g++ or clang++ does. So it might not be as good on windows. I have an Intel i7-8700 with 6 cores/12 threads, and used clang++ with |
Whoops - rookie mistake - wasn't specifying Just tested both with clang 9 and gcc 9 on ubuntu 19.10 and now I see the same 3x increase - this does indeed seem worthwhile! |
Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics. This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case. Closes #452
merged the PR - closing this issue as well. Will release a new version probably sometime in January - use the |
Adds the configuration option `DOCTEST_CONFIG_NO_MULTI_LANE_ATOMICS` to disable multi lane atomics. This can speed up assertions in highly parallel tests by a factor of 3 and more, with a slight slowdown for the single threaded case. Closes #452
Description
Having lots of
REQUIRE
in a multithreaded test is relatively slow due to the use ofstd::atomic
.I have a test that performs a lot of asserts, and is called in parallel. It's 240016891
REQUIRE
statements. Without theREQUIRE
, the test takes 0.37 seconds on my machine. When I add the REQUIRE, the test takes 8.05 seconds.It seems that most of the slowdown comes from the use of
std::atomic
for the variablenumAssertsCurrentTest_atomic
. The problem is that in my case 12 threads basically block each other by continuously increasing the atomic, which leads to lots of cache invalidation for the other threads.I've played a bit with the code, and have come up with a multi-lane implementation of atomic. This splits up the atomic into multiple atomics, each sitting on a different cache line, and each thread operates on a different atomic so they can't block each other. This speeds up the test from 8.05 seconds to 2.28 seconds on my machine.
Steps to reproduce
Create a test that spawns many threads, each calling
REQUIRE
in a tight loop.Put this into
doctest.h
:Then, replace the line
std::atomic<int> numAssertsCurrentTest_atomic;
with
MultiLaneAtomic<int> numAssertsCurrentTest_atomic;
The test will run significantly faster with the
MultiLaneAtomic<int>
.If you think this is worthwhile, I could create a pull request for this.
The text was updated successfully, but these errors were encountered: