Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testFWCoreUtilities fails in ARM IBs #34647

Closed
iarspider opened this issue Jul 27, 2021 · 24 comments
Closed

testFWCoreUtilities fails in ARM IBs #34647

iarspider opened this issue Jul 27, 2021 · 24 comments

Comments

@iarspider
Copy link
Contributor

Log: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/cc8_aarch64_gcc9/CMSSW_12_0_X_2021-07-19-2300/unitTestLogs/FWCore/Utilities#/121-121

===== Test "testFWCoreUtilities" ====
Running ...............................F...

reusableobjectholder_t.cppunit.cpp:240:Assertion
Test name: reusableobjectholder_test::testSimultaneousUse
assertion failed
- Expression: t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3

Failures !!!
Run: 34   Failure total: 1   Failures: 1   Errors: 0

---> test testFWCoreUtilities had ERRORS
TestTime:29
^^^^ End Test testFWCoreUtilities ^^^^
@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Seems that the test is failing consistently, and started to fail before CMSSW_12_0_X_2021-07-19-2300 / CMSSW_12_0_X_2021-07-20-1100 that are the earliest IBs that have test results still available.

@makortel
Copy link
Contributor

I'm unable to reproduce the failure locally. Likely related to the load of the ARM nodes (that isn't too high at the moment). I'm thinking to enable the printout

CPPUNIT_ASSERT(t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3);
CPPUNIT_ASSERT(t2ItemsSeen.size() > 0 && t2ItemsSeen.size() < 3);
//std::cout <<" # seen: "<<t1ItemsSeen.size() <<" "<<t2ItemsSeen.size()<<std::endl;

(but before the asserts) to see a bit more what is going on when the asserts fail.

@makortel
Copy link
Contributor

I'm thinking to enable the printout (but before the asserts) to see a bit more what is going on when the asserts fail.

Done in #34651

@makortel
Copy link
Contributor

Here is the printout from CMSSW_12_0_X_2021-07-28-2300

===== Test "testFWCoreUtilities" ====
Running ............................... # seen: 3 3
F...

reusableobjectholder_t.cppunit.cpp:241:Assertion
Test name: reusableobjectholder_test::testSimultaneousUse
assertion failed
- Expression: t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_aarch64_gcc9/CMSSW_12_0_X_2021-07-28-2300/unitTestLogs/FWCore/Utilities#/122-122

Then to think how the number can be 3.

@perrotta
Copy link
Contributor

@dan131riley
Copy link

Then to think how the number can be 3.

Or even 4:

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_ppc64le_gcc9/CMSSW_12_0_X_2021-07-29-2300/unitTestLogs/FWCore/Utilities#/111

Perhaps more significant, that failure is on ppc64. I had been unable to reproduce the crash on my Apple Silicon M1, so I was leaning towards the theory that it could be another issue with the idiosyncrasies of the Cavium ThunderX cores, but if we're seeing it on Power9 too that's a different story.

@aandvalenzuela
Copy link
Contributor

It is still failing in both aarch64 and ppc64le (but with a 7 in ppc!):

===== Test "testFWCoreUtilities" ====
Running ................ # seen: 7 7
F..................

reusableobjectholder_t.cppunit.cpp:241:Assertion
Test name: reusableobjectholder_test::testSimultaneousUse
assertion failed
- Expression: t1ItemsSeen.size() > 0 && t1ItemsSeen.size() < 3

Failures !!!
Run: 34   Failure total: 1   Failures: 1   Errors: 0

---> test testFWCoreUtilities had ERRORS

@makortel
Copy link
Contributor

@Dr15Jones Maybe we should consider just removing the test. We haven't really had capacity to deeply understand the behavior (of TBB) on ARM and PPC, and (as discussed at the time) the doesn't assert strictly necessary behavior.

@Dr15Jones
Copy link
Contributor

We could remove it but it is a useful reminder that something is going on with the atomics on those platforms and it would be good to figure out exactly what is happening (since the codes INTENT is to only give 3 objects at max even though it strictly isn't a problem mechanically if it does more).

@dan131riley
Copy link

I'd vote for keeping it, understanding what's going on is still on my todo list

@fwyzard
Copy link
Contributor

fwyzard commented Mar 3, 2023

I spent some time on this issue this morning, and I think I traced to the oneTBB implementation of concurrent_queue::try_pop() (see include/oneapi/tbb/concurrent_queue.h#L180-L195):

    bool internal_try_pop( void* dst ) {
        ticket_type k;
        do {
            k = my_queue_representation->head_counter.load(std::memory_order_relaxed);
            do {
                if (static_cast<std::ptrdiff_t>(my_queue_representation->tail_counter.load(std::memory_order_relaxed) - k) <= 0) {
                    // Queue is empty
                    return false;
                }

                // Queue had item with ticket k when we looked. Attempt to get that item.
                // Another thread snatched the item, retry.
            } while (!my_queue_representation->head_counter.compare_exchange_strong(k, k + 1));
        } while (!my_queue_representation->choose(k).pop(dst, k, *my_queue_representation, my_allocator));
        return true;
    }

Due to the first std::memory_order_relaxed, try_pop may sometime fail spuriously on architectures that have a relaxed memory ordering like Power and ARM.

Changing it to std::memory_order_acquire and rebuilding the test fixed it for me on a Power 8 machine (run successfully 20 times out of 20).

@fwyzard
Copy link
Contributor

fwyzard commented Mar 3, 2023

Note that this is fixed in the master branch of oneTBB (see include/oneapi/tbb/concurrent_queue.h#L31-L47):

template <typename QueueRep, typename Allocator>
std::pair<bool, ticket_type> internal_try_pop_impl(void* dst, QueueRep& queue, Allocator& alloc ) {
    ticket_type ticket{};
    do {
        // Basically, we need to read `head_counter` before `tail_counter`. To achieve it we build happens-before on `head_counter`
        ticket = queue.head_counter.load(std::memory_order_acquire);
        do {
            if (static_cast<std::ptrdiff_t>(queue.tail_counter.load(std::memory_order_relaxed) - ticket) <= 0) { // queue is empty
                // Queue is empty
                return { false, ticket };
            }
            // Queue had item with ticket k when we looked.  Attempt to get that item.
            // Another thread snatched the item, retry.
        } while (!queue.head_counter.compare_exchange_strong(ticket, ticket + 1));
    } while (!queue.choose(ticket).pop(dst, ticket, queue, alloc));
    return { true, ticket };
}

@makortel
Copy link
Contributor

makortel commented Mar 3, 2023

Thanks @fwyzard for digging out the cause! I see the fix in oneTBB landed with uxlfoundation/oneTBB#782, that did not enter in the latest v2021.8.0 release.

@smuzaffar Should we try adding a patch, or wait for a new release?

@dan131riley
Copy link

This might have something to do with some other crashes, so a patch would be nice.

@smuzaffar
Copy link
Contributor

I am opening cmsdist Pr with the https://patch-diff.githubusercontent.com/raw/oneapi-src/oneTBB/pull/782.patch patch

@makortel
Copy link
Contributor

makortel commented Mar 7, 2023

ARM unit tests seem to be clean since CMSSW_13_1_X_2023-03-03-2300. I'd wait for another week or so to see results of more tests.

On the other hand, I wasn't able to to tell from the IB dashboard the IB where cms-sw/cmsdist#8355 was included. I see

I don't see any cmsdist changes in the IBs between those two (for el8_amd64_gcc11), but according to the merge time stamps cms-sw/cmsdist#8355 was merged between the other two PRs.

@smuzaffar
Copy link
Contributor

@makortel , IB dashboard only shows merge requests and for cms-sw/cmsdist#8355 we squash and merge the PR that is why it is not visible on the dash-board. cms-sw/cmsdist#8355 was merged 2 days ago so it should be in last few IBs

@makortel
Copy link
Contributor

I have not seen any unit test failures on ARM since the fix, so it we can close the issue.

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants