-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in egHLT::trigTools::getFiltersPassed CMSSW_8_1_THREADED_X_2016-06-12-2300 #14897
Comments
A new Issue was created by @bouril Dimitri Bourilkov. @davidlange6, @smuzaffar, @Degano, @davidlt, @Dr15Jones can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@dan131riley This appears to have happen again but this time two threads in the same job crashed ! That's a new one to me: https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc530/CMSSW_8_1_THREADED_X_2016-06-28-2300/pyRelValMatrixLogs/run/250205.0_ZTT_13+ZTT_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25/step3_ZTT_13+ZTT_13INPUT+DIGIPRMXUP15_PU25+RECOPRMXUP15_PU25+HARVESTUP15_PU25.log
|
Something I had missed before is the crash happens in |
So I think I figured out why line 56 is size_t filterNrInEvt = trigEvt->filterIndex(edm::InputTag(filters[filterNrInVec],"",hltTag).encode()); What this line does is create a temporary Although this is just wasting CPU I don't see why it would lead to a crash. |
I got rid of the unnecessary conversions #15051 |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I'm having a hard time imagining how edm::InputTag().encode() produces a string that crashes edm::InputTag(). Also odd that in both the latest two crashes another thread is in basically the same spot deep inside TotemRPGeometryESModule. Will see if these reproduce... |
I agree about |
I noticed that there are two different crashes, one is |
So far I've completely failed to reproduce this one, even running hundreds of (automated) trials of three different workflows, so it seems like I'm missing some necessary condition. Both crashes are in calls to TriggerEvent::filterIndex(), which converts the InputTag back to a string:
so the net effect (without fixes) is something like
and InputTag(string) does a bunch more string manipulations (I'm thinking about a test making InputTag(string) explicit just to see how much we're using it in implicit conversions). There's a lot of memory allocations and frees, and I'm guessing maybe some of those are hitting mutexes in jemalloc. So I've been thinking about how mutexes can open up race windows. What I mean is that with something like:
in practical terms there's no race condition on gMinuit, it'll be in a register so in our use it is only a theoretical race--the practical race is on corrupting the specials list with simultaneous updates. Add a mutex:
and that fixes the race updating the specials lists, but it also turns the gMinuit race from theoretical to real with a huge window--the threads pile up at the mutex and it forces a memory read of gMinuit afterward. My hunch is that there's some pattern of allocations that's hitting a mutex in the allocator, some memory overwrite is happening while threads are sitting on that mutex, the crash happens after return from the allocator, and that synchronization point is how we can end up with two threads crashing at the same spot at the same time. I've tried inserting nano-sleeps into InputTag(string) to widen the window, but that doesn't do it, will have to try with a mutex and a random sleep so the threads are synced. There might be some utility to an LD_PRELOAD that adds random delays to every mutex acquisition, or some other way of adding artificial jitter. I know of one profiling tool that slows down sections of code for profiling purposes (https://github.com/plasma-umass/coz), but haven't found any debugging tools that systematically introduce synchronization jitter. Or I could be off in completely the wrong direction... -dan |
Seems like a reasonable guess to me. The idea of the mutex timing causing the appearance of synchronization like behavior is really interesting. |
with the new stack-trace-pause handlers. Two threads are in tokenize, three are in InputTag, all four are in egHLT::trigTools::getFiltersPassed. Could there be an issue with tokenize?
|
This code has a weird behavior. The line size_t filterNrInEvt = trigEvt->filterIndex(edm::InputTag(filters[filterNrInVec].first,"",hltTag).encode()); creates a temporary I'm going to create a pull request which removes the call to |
@dan131riley Do you want to see if there is a problem in tokenizer? |
Now reading the whole thread, I see that I had fixed other |
I'll take another look at it--I already ran through the bizarre chain of construct/tokenize/construct once and didn't spot anything, but could have missed something. I had also started on tracking down all the implicit uses of the InputTag(string) constructor (by making it explicit), should finish that, there's a few other places where we're calling it without any reason. |
although that should be safe, maybe there is a problem with the code the compiler generates? Also, there is no reason to have that as a function local |
Make it file scope and we can use it here too: http://cmslxr.fnal.gov/source/FWCore/Utilities/src/InputTag.cc#0054 std::vectorstd::string tokens = tokenize(s, ":"); args to tokenize are (const string& input, const string& separator) and then it immediately does separator.c_str(), so we're doing string(":").c_str()... |
This one is very interesting--two threads in EgHLTOfflineSource, one in TrackingMonitor (our other favorite module, #15145) and one just having left it. The actual crash is in jemalloc, underneath edm::InputTag::encode(). This adds some support to the theory that something else is corrupting the memory allocator (which EgHLTOfflineSource vigorously exercises with its InputTag thrashing), and raises the possibility that the something else could be TrackingMonitor.
|
Note that this crash happened on record 8, while #15145 seems to be an initialization bug, so maybe different bugs but possibly both in TrackingMonitor? Re-running memcheck on this wf... |
I think we have another piece of the puzzle here. The Address Santitizer found the following:
The point of code is here
Notice the fixed sized array without any bounds checking. If we wrote passed the end we'd be doing bad stuff to the stack which could cause all kinds of chaos. |
Good find. I would bet on strtok() being the cause of the crash. It writes null characters into the saved string, and the corruption looks like thread 0 writing a null into a localDescript that was on the stack of thread 2. Good to get both problems fixed, I'll bet it finally resolves this issue. |
@Dr15Jones , should we close this now or you would like to keep it open? |
+1 |
This issue is fully signed and ready to be closed. |
Let's go ahead and close |
Build CMSSW_8_1_THREADED_X_2016-06-12-2300
1335.0 Wjet_Pt_80_120_13
egHLT::trigTools::getFiltersPassed
https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/slc6_amd64_gcc530/CMSSW_8_1_THREADED_X_2016-06-12-2300/pyRelValMatrixLogs/run/1335.0_Wjet_Pt_80_120_13+Wjet_Pt_80_120_13INPUT+DIGIUP15+RECOUP15+HARVESTUP15/step3_Wjet_Pt_80_120_13+Wjet_Pt_80_120_13INPUT+DIGIUP15+RECOUP15+HARVESTUP15.log
Thread 3 (Thread 0x7f20409fe700 (LWP 8330)):
#0 0x0000003fccadf283 in poll () from /lib64/libc.so.6
#1 0x00007f2150dd9d74 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#2 0x00007f2150dd9fda in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#3 0x00007f2150dda15b in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginFWCoreServicesPlugins.so
#4
#5 0x00007f2156bea1f2 in edm::InputTag::InputTag(std::string const&) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreUtilities.so
#6 0x00007f210c80500b in egHLT::trigTools::getFiltersPassed(std::vector<std::pair<std::string, int>, std::allocator<std::pair<std::string, int> > > const&, trigger::TriggerEvent const*, std::string const&, egHLT::TrigCodes const&) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginDQMOfflineTrigger.so
#7 0x00007f210c718dc5 in egHLT::OffHelper::setTrigInfo(edm::Event const&, egHLT::OffEvt&, egHLT::TrigCodes const&) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginDQMOfflineTrigger.so
#8 0x00007f210c71cc19 in egHLT::OffHelper::makeOffEvt(edm::Event const&, edm::EventSetup const&, egHLT::OffEvt&, egHLT::TrigCodes const&) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginDQMOfflineTrigger.so
#9 0x00007f210c80b938 in EgHLTOfflineSource::analyze(edm::Event const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/pluginDQMOfflineTrigger.so
#10 0x00007f2156fa9e9a in edm::stream::EDAnalyzerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetup const&, edm::ActivityRegistry_, edm::ModuleCallingContext const_) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#11 0x00007f2156ef078f in edm::WorkerTedm::stream::EDAnalyzerAdaptorBase::implDo(edm::EventPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#12 0x00007f2156ea450c in decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const_)::{lambda()#1}>(bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const_)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#13 0x00007f2156ea46d2 in bool edm::Worker::doWork<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#14 0x00007f2156ecff5e in decltype ({parm#1}()) edm::convertException::wrap<void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const_)::{lambda()#1}>(void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const_)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#15 0x00007f2156ed0212 in void edm::Path::processOneOccurrence<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#16 0x00007f2156ed0635 in void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}::operator()() const () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#17 0x00007f2156ed092b in decltype ({parm#1}()) edm::convertException::wrap<void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}>(void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool)::{lambda()#1}) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#18 0x00007f2156ed0b9f in void edm::StreamSchedule::processOneEvent<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal&, edm::EventSetup const&, bool) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#19 0x00007f2156f47122 in edm::EventProcessor::processEvent(unsigned int) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#20 0x00007f2156f47857 in edm::EventProcessor::processEventsForStreamAsync(unsigned int, std::atomic*) () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#21 0x00007f2156f4e104 in edm::StreamProcessingTask::execute() () from /cvmfs/cms-ib.cern.ch/week1/slc6_amd64_gcc530/cms/cmssw/CMSSW_8_1_THREADED_X_2016-06-12-2300/lib/slc6_amd64_gcc530/libFWCoreFramework.so
#22 0x00007f2155c8e9b5 in tbb::internal::custom_schedulertbb::internal::IntelSchedulerTraits::local_wait_for_all (this=0x7f21324a7e00, parent=..., child=) at ../../src/tbb/custom_scheduler.h:463
#23 0x00007f2155c881a4 in tbb::internal::arena::process (this=this@entry=0x7f21547df780, s=...) at ../../src/tbb/arena.cpp:156
#24 0x00007f2155c87264 in tbb::internal::market::process (this=0x7f2154807c00, j=...) at ../../src/tbb/market.cpp:502
#25 0x00007f2155c83736 in tbb::internal::rml::private_worker::run (this=0x7f2154626180) at ../../src/tbb/private_server.cpp:275
#26 0x00007f2155c83989 in tbb::internal::rml::private_worker::thread_routine (arg=) at ../../src/tbb/private_server.cpp:228
#27 0x0000003fcce07aa1 in start_thread () from /lib64/libpthread.so.0
#28 0x0000003fccae8aad in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f203feff700 (LWP 8331)):
The text was updated successfully, but these errors were encountered: