Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in data processing #34835

Closed
kskovpen opened this issue Aug 10, 2021 · 24 comments
Closed

Segfault in data processing #34835

kskovpen opened this issue Aug 10, 2021 · 24 comments

Comments

@kskovpen
Copy link
Contributor

Hello,

We are observing a segmentation violation in one of the data processing workflows. Full log is available here:

https://cms-unified.web.cern.ch/cms-unified/joblogs/haozturk_r-1-Run2018D_EGamma_12Nov2019_UL2018_210804_153732_5282/139/DataProcessing/176bb428-0e57-467d-9216-cbda860fc7c8-0-3-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log.trunc.txt

The issue is also reproducible locally and might be coming from TrackingToolsTrackAssociator. If someone could have a look, please let us know!

PdmV @bbilin @jmartinb, also for @haozturk

@cmsbuild
Copy link
Contributor

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

The thread with the crash reports the following back trace

#3  0x00002b37eea7f438 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00002b3803e9b366 in SiPixelTemplate2D::interpolate(int, float, float, float, float) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libCondFormatsSiPixelTransient.so
#6  0x00002b3803a55516 in SiPixelTemplateReco2D::PixelTempReco2D(int, float, float, float, float, int, int, SiPixelTemplateReco2D::ClusMatrix&, SiPixelTemplate2D&, float&, float&, float&, float&, float&, float&, int&, float&, int&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoLocalTrackerSiPixelRecHits.so
#7  0x00002b3803a5f8e0 in PixelCPEClusterRepair::callTempReco2D(PixelCPEBase::DetParam const&, PixelCPEClusterRepair::ClusterParamTemplate&, SiPixelTemplateReco2D::ClusMatrix&, int, Point3DBase<float, LocalTag>&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoLocalTrackerSiPixelRecHits.so
#8  0x00002b3803a613ce in PixelCPEClusterRepair::localPosition(PixelCPEBase::DetParam const&, PixelCPEBase::ClusterParam&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoLocalTrackerSiPixelRecHits.so
#9  0x00002b3803a5ae66 in PixelClusterParameterEstimator::getParameters(SiPixelCluster const&, GeomDet const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoLocalTrackerSiPixelRecHits.so
#10 0x00002b38033d4bd9 in TkClonerImpl::makeShared(SiPixelRecHit const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoTrackerTransientTrackingRecHit.so
#11 0x00002b37fab4973a in SiPixelRecHit::cloneSH_(TkCloner const&, TrajectoryStateOnSurface const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libDataFormatsTrackerRecHit2D.so
#12 0x00002b3804c7546d in KFTrajectorySmoother::trajectory(Trajectory const&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libTrackingToolsTrackFitters.so
#13 0x00002b3804c50242 in (anonymous namespace)::KFFittingSmoother::smoothingStep(Trajectory&&) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginTrackingToolsTrackFittersPlugins.so
#14 0x00002b3804c51b8d in (anonymous namespace)::KFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginTrackingToolsTrackFittersPlugins.so
#15 0x00002b3804c49a7e in (anonymous namespace)::FlexibleKFFittingSmoother::fitOne(TrajectorySeed const&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > > const&, TrajectoryStateOnSurface const&, TrajectoryFitter::fitType) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginTrackingToolsTrackFittersPlugins.so
#16 0x00002b38320135d2 in TrackProducerAlgorithm<reco::Track>::buildTrack(TrajectoryFitter const*, Propagator const*, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&, std::vector<std::shared_ptr<TrackingRecHit const>, std::allocator<std::shared_ptr<TrackingRecHit const> > >&, TrajectoryStateOnSurface&, TrajectorySeed const&, float, reco::BeamSpot const&, edm::RefToBase<TrajectorySeed>, int, signed char) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/libRecoTrackerTrackProducer.so
#17 0x00002b3831dc9602 in TrackProducerAlgorithm<reco::Track>::runWithCandidate(TrackingGeometry const*, MagneticField const*, std::vector<TrackCandidate, std::allocator<TrackCandidate> > const&, TrajectoryFitter const*, Propagator const*, TransientTrackingRecHitBuilder const*, reco::BeamSpot const&, std::vector<AlgoProductTraits<reco::Track>::AlgoProduct, std::allocator<AlgoProductTraits<reco::Track>::AlgoProduct> >&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginRecoEgammaEgammaPhotonProducers.so
#18 0x00002b383510a078 in TrackProducer::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_4/lib/slc7_amd64_gcc700/pluginRecoTrackerTrackProducerPlugins.so

@Dr15Jones
Copy link
Contributor

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@perrotta,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@slava77
Copy link
Contributor

slava77 commented Aug 10, 2021

@cms-sw/trk-dpg-l2
please take a look.

This is CMSSW_10_6_4_patch1
If I'm not mistaken, the config is here https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/0628749ed3ef23f28a8cc86fb829c87e/configFile
The data is from run 322106 LumiSection 140 in Run2018D EGamma PD.

@slava77
Copy link
Contributor

slava77 commented Aug 10, 2021

The data is from run 322106 LumiSection 140 in Run2018D EGamma PD.

I was trying to find some info about this LS in OMS and for some reason the details there start from LS 889
https://cmsoms.cern.ch/cms/runs/report?cms_run=322106&cms_run_sequence=GLOBAL-RUN

@kskovpen
Copy link
Contributor Author

After applying sorting on LS in the web interface, I can see it :)

@slava77
Copy link
Contributor

slava77 commented Aug 10, 2021

https://cmsoms.cern.ch/cms/runs/lumisection?cms_run=322106&cms_run_sequence=GLOBAL-RUN
I found in a different link that it's at least all green in DCS

Where are the JSON files these days? my old bookmark to https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/ does not work anymore.

@kskovpen
Copy link
Contributor Author

We've checked that this run and LS are in the official json files.
The new location is at https://cms-service-dqmdc.web.cern.ch/CAF/certification/

@slava77
Copy link
Contributor

slava77 commented Aug 10, 2021

We've checked that this run and LS are in the official json files.
The new location is at https://cms-service-dqmdc.web.cern.ch/CAF/certification/

Thanks.
It sounds like some effort to understand what happened is in order.

@slava77
Copy link
Contributor

slava77 commented Aug 10, 2021

@Dr15Jones @makortel
the log does not say which event processing had a crash.
Only that Module: TrackProducer:lowPtTripletStepTracks (crashed)
Is this available in a more recent release so that debugging like this can be easier?

@mmusich
Copy link
Contributor

mmusich commented Aug 11, 2021

tagging @OzAmram and @ferencek as well.

@kskovpen, this seems to come from the 2018 Ultra-Legacy processing, can you clarify how is it possible it comes out only now? Has this LS been failing since the start of the campaign?

@kskovpen
Copy link
Contributor Author

tagging @OzAmram and @ferencek as well.

@kskovpen, this seems to come from the 2018 Ultra-Legacy processing, can you clarify how is it possible it comes out only now? Has this LS been failing since the start of the campaign?

Apparently, there have been multiple attempts on recovering this failing job, all resulting in the same issue mentioned above. We are now tracking down the last remaining pieces/issues in the tails of the UL processing.

@OzAmram
Copy link
Contributor

OzAmram commented Aug 11, 2021

@cms-sw/trk-dpg-l2
please take a look.

This is CMSSW_10_6_4_patch1
If I'm not mistaken, the config is here https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/0628749ed3ef23f28a8cc86fb829c87e/configFile
The data is from run 322106 LumiSection 140 in Run2018D EGamma PD.

Has anyone been able to reproduce this error locally?

I when I try to run on what I think is the offending file /store/data/Run2018D/EGamma/RAW/v1/000/322/106/00000/5A6F133C-02AF-E811-9E63-02163E019F2D.root with the above config in CMSSW_10_6_4_patch1 I get NULL pointer to FEDRawData for FED errors and not the seg-fault. I'm assuming something about my setup is incorrect.

@mmusich
Copy link
Contributor

mmusich commented Aug 11, 2021

@kskovpen

The issue is also reproducible locally and might be coming from TrackingToolsTrackAssociator. If someone could have a look, please let us know!

can you provide please instructions to reproduce locally?

@OzAmram
Copy link
Contributor

OzAmram commented Aug 11, 2021

So we were able to successfully reproduce the crash, which was coming specifically from event 196973498.
The crash was occuring due to a call to SiPixelTemplate2D::interpolate for which the input track angles were both NaN. I'm not sure what upstream was causing the NaN's, but a simple fix is to check that both angles are finite before proceeding and return a failure if not. There is already fallback mechanism in case of interpolation failure, normally used for track angles outside the template acceptance.

Here is a branch with this simple fix, and I have tested that it does indeed fix the crash. Let me know how everyone would like to proceed.
CMSSW_10_6_X...OzAmram:Template2D_isFinite_fix

@kskovpen
Copy link
Contributor Author

Thanks @OzAmram for the speed-of-light fix - I was just about to provide the recipe, but you were faster. As for how to proceed, I guess @slava77 would have some ideas.

@slava77
Copy link
Contributor

slava77 commented Aug 11, 2021

Here is a branch with this simple fix, and I have tested that it does indeed fix the crash. Let me know how everyone would like to proceed.
CMSSW_10_6_X...OzAmram:Template2D_isFinite_fix

The standard procedure is to apply the update in the master and then consider a backport.

@kskovpen if we make an update in the software, is the production machinery capable to rerun this LS in a new release?
If it can happen, my guess is that an update on top of CMSSW_10_6_4_patch2 will be needed (the reference/crashing job is in CMSSW_10_6_4_patch1).

If the recovery in the same target dataset is not possible, then we should at least apply a fix in 10_6_X for possible new campaigns (although I have doubts that we'd have any).

@OzAmram
Copy link
Contributor

OzAmram commented Aug 11, 2021

Ok I went ahead and made a PR for master as we decide on the plan for a backport. PR is #34846

@kskovpen
Copy link
Contributor Author

Here is a branch with this simple fix, and I have tested that it does indeed fix the crash. Let me know how everyone would like to proceed.
CMSSW_10_6_X...OzAmram:Template2D_isFinite_fix

The standard procedure is to apply the update in the master and then consider a backport.

@kskovpen if we make an update in the software, is the production machinery capable to rerun this LS in a new release?
If it can happen, my guess is that an update on top of CMSSW_10_6_4_patch2 will be needed (the reference/crashing job is in CMSSW_10_6_4_patch1).

If the recovery in the same target dataset is not possible, then we should at least apply a fix in 10_6_X for possible new campaigns (although I have doubts that we'd have any).

Thanks @slava77 for your input. Let me see if @haozturk or @justinasr had such past experience, i.e. would it be possible to rerun the workflow in the updated cmssw release?

@justinasr
Copy link
Contributor

Let me see if @haozturk or @justinasr had such past experience, i.e. would it be possible to rerun the workflow in the updated cmssw release?

The usual and only procedure from our side (i.e. capabilities of PdmV machinery) would be to reset and resubmit the whole request with new release. If we make a new request to re-run only that run/lumisection, this will end up producing a separate output dataset (like extension) which is probably not desirable.
I think we could manually craft some requests/workflows if needed, but bigger question is whether computing side allows to modify release of selected jobs and rerun them.

@makortel
Copy link
Contributor

the log does not say which event processing had a crash.
Only that Module: TrackProducer:lowPtTripletStepTracks (crashed)
Is this available in a more recent release so that debugging like this can be easier?

No, for crashes we don't report event (or lumi or run) numbers.

@slava77
Copy link
Contributor

slava77 commented Sep 10, 2021

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@qliphy qliphy closed this as completed Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants