You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After we updated oneTBB from 2021.8.0 to 2021.9.0 we started to see occasional segfaults in concurrent_queue on ARM. The stack trace of the segfault is along
#4 <signal handler called>
#5 0x000040008006b318 in void tbb::detail::d2::concurrent_queue<mkfit::MkFinder*, tbb::detail::d1::cache_aligned_allocator<mkfit::MkFinder*> >::internal_push<mkfit::MkFinder* const&>(mkfit::MkFinder* const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#6 0x0000400080075128 in mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}::operator()(int) const::{lambda(tbb::detail::d1::blocked_range<int> const&)#1}::operator()(tbb::detail::d1::blocked_range<int> const&) const [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#7 0x000040008007758c in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<int>, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}::operator()(int) const::{lambda(tbb::detail::d1::blocked_range<int> const&)#1}, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#8 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x40008f545000, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#9 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#10 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#11 0x0000400080079918 in void tbb::detail::d1::dynamic_grainsize_mode<tbb::detail::d1::adaptive_mode<tbb::detail::d1::auto_partition_type> >::work_balance<tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>, tbb::detail::d1::blocked_range<unsigned long> >(tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>&, tbb::detail::d1::blocked_range<unsigned long>&, tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#12 0x0000400080079f6c in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#13 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x400033c3c400, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#16 0x000040008006d1bc in tbb::detail::d2::for_each_root_task<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int, std::random_access_iterator_tag>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#17 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0xfffff1ae8200, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#20 0x000040008006d858 in void tbb::detail::d2::parallel_for_each<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}>(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1} const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#21 0x000040008006f4a0 in mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#22 0x000040007fffb940 in mkfit::run_OneIteration(mkfit::TrackerInfo const&, mkfit::IterationConfig const&, mkfit::EventOfHits const&, std::vector<std::vector<bool, std::allocator<bool> > const*, std::allocator<std::vector<bool, std::allocator<bool> > const*> > const&, mkfit::MkBuilder&, std::vector<mkfit::Track, std::allocator<mkfit::Track> >&, std::vector<mkfit::Track, std::allocator<mkfit::Track> >&, bool, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCMS.so
#23 0x000040007ff69390 in tbb::detail::d1::task_arena_function<MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const::{lambda()#1}, void>::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMkFitPlugins.so
#24 0x0000400032d516e4 in operator() (__closure=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:757
#25 tbb::detail::d0::try_call_proxy<tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> >::on_completion<tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> > (on_completion_body=..., this=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/../../include/oneapi/tbb/detail/_template_helpers.h:230
#26 tbb::detail::r1::isolate_within_arena (d=..., isolation=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:758
#27 0x000040007ff6e4ac in MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMkFitPlugins.so
#28 0x00004000313003d0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#29 0x00004000312fa090 in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#30 0x0000400031286e08 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#31 0x0000400031287570 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#32 0x0000400031825160 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreConcurrency.so
#33 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x4000ec159300, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#34 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#35 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#36 0x00004000311ff4d4 in edm::FinalWaitingTask::wait() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#37 0x000040003120c698 in edm::EventProcessor::processRuns() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#38 0x000040003120cc50 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#39 0x000000000040b9c8 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#40 0x0000400032d510a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#41 0x000000000040f74c in main::{lambda()#1}::operator()() const ()
#42 0x0000000000406fc4 in main ()
(the particular ARM CPU is Cavium ThunderX that, to my understanding, has even more relaxed memory model than most other ARM CPUs).
I noticed from the diff between 2021.8 and 2021.9 that many atomic loads and stores in _concurrent_queue_base.h were made more relaxed, and am wondering if any of those could play a role in these crashes?
The text was updated successfully, but these errors were encountered:
That is defiantly possible. Unfortunately, our tests don't reproduce this issue. (even under Thread Sanitizer)
If it is a memory order problem we will not be able to figure out the problem only with a crash stack.
Could you please rebuild oneTBB with -DTBB_SANITIZE=thread and run it with such oneTBB? (If there is happens/before issue Thread Sanitizer should detect it.)
Nevermind, I realized now that all the concurrent_queue changes in 2021.9 came from #782, and earlier we used 2021.8 with the #782 applied on top of it (for more than 3 months before updating to 2021.9), so it seems unlikely the new crashes we've observed would be caused by the changes in the concurrent_queue specifically.
Hi,
After we updated oneTBB from 2021.8.0 to 2021.9.0 we started to see occasional segfaults in
concurrent_queue
on ARM. The stack trace of the segfault is along(the particular ARM CPU is Cavium ThunderX that, to my understanding, has even more relaxed memory model than most other ARM CPUs).
I noticed from the diff between 2021.8 and 2021.9 that many atomic loads and stores in
_concurrent_queue_base.h
were made more relaxed, and am wondering if any of those could play a role in these crashes?The text was updated successfully, but these errors were encountered: