Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible concurrency issue in concurrent_queue? #1139

Closed
makortel opened this issue Jun 27, 2023 · 2 comments
Closed

Possible concurrency issue in concurrent_queue? #1139

makortel opened this issue Jun 27, 2023 · 2 comments

Comments

@makortel
Copy link

Hi,

After we updated oneTBB from 2021.8.0 to 2021.9.0 we started to see occasional segfaults in concurrent_queue on ARM. The stack trace of the segfault is along

#4  <signal handler called>
#5  0x000040008006b318 in void tbb::detail::d2::concurrent_queue<mkfit::MkFinder*, tbb::detail::d1::cache_aligned_allocator<mkfit::MkFinder*> >::internal_push<mkfit::MkFinder* const&>(mkfit::MkFinder* const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#6  0x0000400080075128 in mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}::operator()(int) const::{lambda(tbb::detail::d1::blocked_range<int> const&)#1}::operator()(tbb::detail::d1::blocked_range<int> const&) const [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#7  0x000040008007758c in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<int>, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}::operator()(int) const::{lambda(tbb::detail::d1::blocked_range<int> const&)#1}, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#8  0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x40008f545000, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#9  tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#10 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#11 0x0000400080079918 in void tbb::detail::d1::dynamic_grainsize_mode<tbb::detail::d1::adaptive_mode<tbb::detail::d1::auto_partition_type> >::work_balance<tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>, tbb::detail::d1::blocked_range<unsigned long> >(tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>&, tbb::detail::d1::blocked_range<unsigned long>&, tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#12 0x0000400080079f6c in tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned long>, tbb::detail::d2::parallel_for_body_wrapper<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int>, tbb::detail::d1::auto_partitioner const>::execute(tbb::detail::d1::execution_data&) [clone .lto_priv.0] () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#13 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x400033c3c400, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#16 0x000040008006d1bc in tbb::detail::d2::for_each_root_task<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}, int, std::random_access_iterator_tag>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#17 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0xfffff1ae8200, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#19 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#20 0x000040008006d858 in void tbb::detail::d2::parallel_for_each<__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1}>(__gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, __gnu_cxx::__normal_iterator<int const*, std::vector<int, std::allocator<int> > >, mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e)::{lambda(int)#1} const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#21 0x000040008006f4a0 in mkfit::MkBuilder::findTracksCloneEngine(mkfit::SteeringParams::IterationType_e) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCore.so
#22 0x000040007fffb940 in mkfit::run_OneIteration(mkfit::TrackerInfo const&, mkfit::IterationConfig const&, mkfit::EventOfHits const&, std::vector<std::vector<bool, std::allocator<bool> > const*, std::allocator<std::vector<bool, std::allocator<bool> > const*> > const&, mkfit::MkBuilder&, std::vector<mkfit::Track, std::allocator<mkfit::Track> >&, std::vector<mkfit::Track, std::allocator<mkfit::Track> >&, bool, bool, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libRecoTrackerMkFitCMS.so
#23 0x000040007ff69390 in tbb::detail::d1::task_arena_function<MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const::{lambda()#1}, void>::operator()() const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMkFitPlugins.so
#24 0x0000400032d516e4 in operator() (__closure=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:757
#25 tbb::detail::d0::try_call_proxy<tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> >::on_completion<tbb::detail::r1::isolate_within_arena(tbb::detail::d1::delegate_base&, intptr_t)::<lambda()> > (on_completion_body=..., this=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/../../include/oneapi/tbb/detail/_template_helpers.h:230
#26 tbb::detail::r1::isolate_within_arena (d=..., isolation=<optimized out>) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:758
#27 0x000040007ff6e4ac in MkFitProducer::produce(edm::StreamID, edm::Event&, edm::EventSetup const&) const () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/pluginRecoTrackerMkFitPlugins.so
#28 0x00004000313003d0 in edm::global::EDProducerBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#29 0x00004000312fa090 in edm::WorkerT<edm::global::EDProducerBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#30 0x0000400031286e08 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#31 0x0000400031287570 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#32 0x0000400031825160 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreConcurrency.so
#33 0x0000400032d58774 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x4000ec159300, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#34 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x400033c33d80) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#35 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#36 0x00004000311ff4d4 in edm::FinalWaitingTask::wait() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#37 0x000040003120c698 in edm::EventProcessor::processRuns() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#38 0x000040003120cc50 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02791/el8_aarch64_gcc11/cms/cmssw/CMSSW_13_2_X_2023-06-25-2300/lib/el8_aarch64_gcc11/libFWCoreFramework.so
#39 0x000000000040b9c8 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#40 0x0000400032d510a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins_b/workspace/jenkins-test-bootstrap/toolconf/BUILD/el8_aarch64_gcc11/external/tbb/v2021.9.0-d9b6b79f96fc04849cfc544d9852057d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#41 0x000000000040f74c in main::{lambda()#1}::operator()() const ()
#42 0x0000000000406fc4 in main ()

(the particular ARM CPU is Cavium ThunderX that, to my understanding, has even more relaxed memory model than most other ARM CPUs).

I noticed from the diff between 2021.8 and 2021.9 that many atomic loads and stores in _concurrent_queue_base.h were made more relaxed, and am wondering if any of those could play a role in these crashes?

@pavelkumbrasev
Copy link
Contributor

pavelkumbrasev commented Jun 27, 2023

That is defiantly possible. Unfortunately, our tests don't reproduce this issue. (even under Thread Sanitizer)
If it is a memory order problem we will not be able to figure out the problem only with a crash stack.
Could you please rebuild oneTBB with -DTBB_SANITIZE=thread and run it with such oneTBB? (If there is happens/before issue Thread Sanitizer should detect it.)

@makortel
Copy link
Author

Nevermind, I realized now that all the concurrent_queue changes in 2021.9 came from #782, and earlier we used 2021.8 with the #782 applied on top of it (for more than 3 months before updating to 2021.9), so it seems unlikely the new crashes we've observed would be caused by the changes in the concurrent_queue specifically.

Sorry for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants