-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
random errors in fastsim addons #24051
Comments
A new Issue was created by @davidlange6 David Lange. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos, @kpedro88 can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign fastsim |
New categories assigned: fastsim @mdhildreth,@ssekmen,@lveldere,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Indeed, I saw it once in one of my PRs (#23703 (comment)) and then it went away. I ran valgrind, but didn't find any memory corruption, just a leak (see #23795). |
Looks like this has been fixed. We have not seen such random erros for PR tests. |
Here is another similar crash from #31245 (comment) in
@smuzaffar Should we consider reopening the issue? |
Here is a similar crash from cms-sw/cmsdist#6343 (comment) in
|
#32152 shows similar stack trace inside |
Here is a similar crash from #32782 (comment) in
|
Here is a similar crash from #36100 (comment) in
|
(@smuzaffar Any idea why the |
@makortel , I think I know the reason but in order to confirm I need to remove your comment |
@makortel , old bot, which was not keeping track of L2's tenures, removed the lable as @kpedro88 L2 tenure ended on 1st Sep 2021. At that point old bot did not recognize #24051 (comment) and remove the lable when #24051 (comment) was added on 21st NOv 2021. New bot properly keeps track of L2's tenures and it treats #24051 (comment) as valid comment and keeps the label. |
Thanks @smuzaffar for the forensic analysis :) |
Just to add, I got across this old issue because of an e-mail from @sarafiorendi that similar errors are apparently happening in production. |
hi, yes indeed I'm running into issues when running the LHEGS step of some FastSim samples. The crashes occur also when executed locally, not always at the same event being run. They can be "reproduced" with the setup/running information from [3] or [4]. The same samples were generated successfully in the FullSim campaign, so I tend to think it's something fastSim specific. [1] Module: FastSimProducer:fastSimProducer (crashed) [2] Module: FastSimProducer:fastSimProducer (crashed) [3] https://cms-pdmv.cern.ch/mcm/public/restapi/requests/get_test/SUS-RunIISpring22UL18FSwmLHEGSPremix-00005 |
Let's see if tagging @cms-sw/fastsim-l2 @cms-sw/trk-dpg-l2 @cms-sw/geometry-l2 helps forward |
This issue was discussed SIM meeting [1], and a likely issue is the increased memory consumption of FastSim. We've seen marginally worsening RSS over time, particularly when running with multi-threading [2]. Kevin Pedro recommended we could run Valgrind on this recipe to look for memory leaks or other problems. It was also suggested at one point to create a dedicated issue for the non-optimal memory topic. I wonder if there's an experienced person (or failing that, a twiki with Valgrind/cmssw or VTune documentation) to help try to solve this. [1] https://indico.cern.ch/event/1236460/contributions/5231552/attachments/2579663/4448912/FastSimNewsJan2023.pdf |
I can't really think of how a memory exhaustion could lead to a segmentation fault. Typical symptoms for memory exhaustion are |
Crash in #41282 (comment) in
|
When you start executing this script, it repeatedly crashed after this The sample request fails validation and it is stuck for several months. |
I'm not able to reproduce. My long test got actually stuck(?) in Pythia8 in event 6001 (or I stopped the test after 9 hours within the event, the last stack trace was
@vhegde91 Can you give pointers to logs of the crashes or copy the stack trace of a crash here? |
@makortel , here is the log that I got after running on lxplus727: https://vhegde.web.cern.ch/vhegde/MCsampleTests/lxplus727_T5WG_SUS-RunIISpring21UL16FSGSPremixLLPBugFix-00031_v1.log This is what I copied from the terminal print out. (I did not include the first few lines). |
Thanks, the stack trace is indeed related
|
Hi all, |
Hi @ALL I would like to investigate this further but I am not able to reproduce the crash - I'm working on lxplus727 and I can run the script, the last command being
and it processes all 5k events. Is the issue there are requests stuck in an official test or was this a private unit test that crashed after 739 events? Any tips for how I could reproduce the crash would be helpful. |
Hi @davidlange6, I think this can be closed, since the problem at least I believe was solved. |
Could you then sign the issue? |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
I've seen this failure a few times in pr tests
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-24037/29441/addOnTests/fastsim1/cmsDriver.py_TTbar_13TeV_TuneCUETP8M1_cfi_--conditions_auto:run2_mc_l1stage1_--fast__-n_100_--eventcontent_AODSIM,DQM_--relval_100000,1000_-s_GEN,SIM,.log
Thread 4 (Thread 0x7fc81b7fe700 (LWP 15838)):
#0 0x0000003752adf403 in poll () from /lib64/libc.so.6
#1 0x00007fc883259fe7 in full_read.constprop () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/pluginFWCoreServicesPlugins.so
#2 0x00007fc88325a67c in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/pluginFWCoreServicesPlugins.so
#3 0x00007fc88325b6e9 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/pluginFWCoreServicesPlugins.so
#4
#5 0x00007fc86d7c468a in TBLayer::groupedCompatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<DetGroup, std::allocator >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/libRecoTrackerTkDetLayers.so
#6 0x00007fc86d735a94 in GeometricSearchDet::compatibleDetsV(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&, std::vector<std::pair<GeomDet const*, TrajectoryStateOnSurface>, std::allocator<std::pair<GeomDet const*, TrajectoryStateOnSurface> > >&) const () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/libTrackingToolsDetLayers.so
#7 0x00007fc86d735a15 in GeometricSearchDet::compatibleDets(TrajectoryStateOnSurface const&, Propagator const&, MeasurementEstimator const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/libTrackingToolsDetLayers.so
#8 0x00007fc86a9d957a in fastsim::TrackerSimHitProducer::interact(fastsim::Particle&, fastsim::SimplifiedGeometry const&, std::vector<std::unique_ptr<fastsim::Particle, std::default_deletefastsim::Particle >, std::allocator<std::unique_ptr<fastsim::Particle, std::default_deletefastsim::Particle > > >&, RandomEngineAndDistribution const&) () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/pluginFastSimulationSimplifiedGeometryPropagatorAuto.so
#9 0x00007fc86a9f0f29 in FastSimProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/nweek-02534/slc6_amd64_gcc700/cms/cmssw/CMSSW_10_3_X_2018-07-24-1100/lib/slc6_amd64_gcc700/pluginFastSimulationSimplifiedGeometryPropagatorAuto.so
The text was updated successfully, but these errors were encountered: