-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testFWCoreUtilities fails in ARM IBs #34647
Comments
A new Issue was created by @iarspider . @Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Seems that the test is failing consistently, and started to fail before CMSSW_12_0_X_2021-07-19-2300 / CMSSW_12_0_X_2021-07-20-1100 that are the earliest IBs that have test results still available. |
I'm unable to reproduce the failure locally. Likely related to the load of the ARM nodes (that isn't too high at the moment). I'm thinking to enable the printout cmssw/FWCore/Utilities/test/reusableobjectholder_t.cppunit.cpp Lines 240 to 242 in 23f1aad
(but before the asserts) to see a bit more what is going on when the asserts fail. |
Done in #34651 |
Here is the printout from CMSSW_12_0_X_2021-07-28-2300
Then to think how the number can be 3. |
Or even 4: |
Perhaps more significant, that failure is on ppc64. I had been unable to reproduce the crash on my Apple Silicon M1, so I was leaning towards the theory that it could be another issue with the idiosyncrasies of the Cavium ThunderX cores, but if we're seeing it on Power9 too that's a different story. |
It is still failing in both aarch64 and ppc64le (but with a 7 in ppc!):
|
@Dr15Jones Maybe we should consider just removing the test. We haven't really had capacity to deeply understand the behavior (of TBB) on ARM and PPC, and (as discussed at the time) the doesn't assert strictly necessary behavior. |
We could remove it but it is a useful reminder that something is going on with the atomics on those platforms and it would be good to figure out exactly what is happening (since the codes INTENT is to only give 3 objects at max even though it strictly isn't a problem mechanically if it does more). |
I'd vote for keeping it, understanding what's going on is still on my todo list |
I spent some time on this issue this morning, and I think I traced to the oneTBB implementation of bool internal_try_pop( void* dst ) {
ticket_type k;
do {
k = my_queue_representation->head_counter.load(std::memory_order_relaxed);
do {
if (static_cast<std::ptrdiff_t>(my_queue_representation->tail_counter.load(std::memory_order_relaxed) - k) <= 0) {
// Queue is empty
return false;
}
// Queue had item with ticket k when we looked. Attempt to get that item.
// Another thread snatched the item, retry.
} while (!my_queue_representation->head_counter.compare_exchange_strong(k, k + 1));
} while (!my_queue_representation->choose(k).pop(dst, k, *my_queue_representation, my_allocator));
return true;
} Due to the first Changing it to |
Note that this is fixed in the template <typename QueueRep, typename Allocator>
std::pair<bool, ticket_type> internal_try_pop_impl(void* dst, QueueRep& queue, Allocator& alloc ) {
ticket_type ticket{};
do {
// Basically, we need to read `head_counter` before `tail_counter`. To achieve it we build happens-before on `head_counter`
ticket = queue.head_counter.load(std::memory_order_acquire);
do {
if (static_cast<std::ptrdiff_t>(queue.tail_counter.load(std::memory_order_relaxed) - ticket) <= 0) { // queue is empty
// Queue is empty
return { false, ticket };
}
// Queue had item with ticket k when we looked. Attempt to get that item.
// Another thread snatched the item, retry.
} while (!queue.head_counter.compare_exchange_strong(ticket, ticket + 1));
} while (!queue.choose(ticket).pop(dst, ticket, queue, alloc));
return { true, ticket };
} |
Thanks @fwyzard for digging out the cause! I see the fix in oneTBB landed with uxlfoundation/oneTBB#782, that did not enter in the latest v2021.8.0 release. @smuzaffar Should we try adding a patch, or wait for a new release? |
This might have something to do with some other crashes, so a patch would be nice. |
I am opening cmsdist Pr with the https://patch-diff.githubusercontent.com/raw/oneapi-src/oneTBB/pull/782.patch patch |
ARM unit tests seem to be clean since CMSSW_13_1_X_2023-03-03-2300. I'd wait for another week or so to see results of more tests. On the other hand, I wasn't able to to tell from the IB dashboard the IB where cms-sw/cmsdist#8355 was included. I see
I don't see any cmsdist changes in the IBs between those two (for el8_amd64_gcc11), but according to the merge time stamps cms-sw/cmsdist#8355 was merged between the other two PRs. |
@makortel , IB dashboard only shows merge requests and for cms-sw/cmsdist#8355 we squash and merge the PR that is why it is not visible on the dash-board. cms-sw/cmsdist#8355 was merged 2 days ago so it should be in last few IBs |
I have not seen any unit test failures on ARM since the fix, so it we can close the issue. |
+core |
@cmsbuild, please close |
This issue is fully signed and ready to be closed. |
Log: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc9/CMSSW_12_0_X_2021-07-19-2300/unitTestLogs/FWCore/Utilities#/121-121
The text was updated successfully, but these errors were encountered: