Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two relvals failing in Geometry #32181

Closed
mrodozov opened this issue Nov 19, 2020 · 24 comments
Closed

Two relvals failing in Geometry #32181

mrodozov opened this issue Nov 19, 2020 · 24 comments

Comments

@mrodozov
Copy link
Contributor

Dears,

in the last night IB there are two dd4hep failures
https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_11_2/2020-11-18-2300?selectedArchs=slc7_amd64_gcc820&selectedFlavors=X&selectedStatus=failed

the relvals failing were added in:
#32096

@mrodozov
Copy link
Contributor Author

assign geometry

@cmsbuild
Copy link
Contributor

New categories assigned: geometry

@Dr15Jones,@cvuosalo,@mdhildreth,@makortel,@ianna,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

The error (from 11624.911) is

PersistencyIO    INFO  +++ Set Streamer to dd4hep::OpaqueDataBlock
CompactLoader    INFO  +++ Processing compact file: /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-18-2300/src/Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2021.xml with flag (null)
DD4CMS           INFO  +++ Processing the CMS detector description file:///cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-18-2300/src/Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2021.xml
Detector         INFO  *********** Created World volume with size: 10100 10100 45000
Detector         INFO  +++ Patching names of anonymous shapes....
DDDefinition     INFO  +++ Finished processing file:///cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_X_2020-11-18-2300/src/Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2021.xml
----- Begin Fatal Exception 19-Nov-2020 08:44:52 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Running EventSetup component GEMGeometryESModule/'gemGeometry
Exception Message:
A std::exception was thrown.
dd4hep:  : value=384 [Evaluation error]
----- End Fatal Exception -------------------------------------------------

@civanch
Copy link
Contributor

civanch commented Nov 19, 2020

@cvuosalo , @slomeo , what is different for GEM from your local runs?

@silviodonato
Copy link
Contributor

Perhaps the error in ASAN might be helpful https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc820/CMSSW_11_2_ASAN_X_2020-11-18-2300/pyRelValMatrixLogs/run/11624.911_TTbar_13+2021_DD4hep+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+Reco+HARVEST+ALCA/step1_TTbar_13+2021_DD4hep+TTbar_13TeV_TuneCUETP8M1_GenSim+Digi+Reco+HARVEST+ALCA.log#/

%MSG-i ThreadStreamSetup:  (NoModuleName) 19-Nov-2020 09:07:30 CET pre-events
setting # threads 4
setting # streams 4
%MSG
WARNING: MCParticlePairFilter : size of some vectors not matching with 2!!
PersistencyIO    INFO  +++ Set Streamer to dd4hep::OpaqueDataBlock
CompactLoader    INFO  +++ Processing compact file: /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-18-2300/src/Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2021.xml with flag (null)
PluginService    ERROR Factory requested: DDDefinition_XML_reader (N10__cxxabiv120__function_type_infoE) :bad any_cast
PluginService    ERROR Stub is invalid!
----- Begin Fatal Exception 19-Nov-2020 09:08:26 CET-----------------------
An exception of category 'StdException' occurred while
   [0] Processing global begin Run run: 1
   [1] Calling method for module OscarMTProducer/'g4SimHits'
   [2] Using EventSetup component DDCompactViewESProducer/'' to make data DDCompactView/'' in record IdealGeometryRecord
   [3] Running EventSetup component DDDetectorESProducer/'
Exception Message:
A std::exception was thrown.
dd4hep: Failed to locate plugin to interprete files of type "DDDefinition" - no factory:DDDefinition_XML_reader. 		No factory with name Create(DDDefinition_XML_reader) for type DDDefinition_XML_reader found.
		Please check library load path and/or plugin factory name.
dd4hep: while parsing /cvmfs/cms-ib.cern.ch/nweek-02655/slc7_amd64_gcc820/cms/cmssw/CMSSW_11_2_ASAN_X_2020-11-18-2300/src/Geometry/CMSCommonData/data/dd4hep/cmsExtendedGeometry2021.xml
dd4hep: with plugin:DD4hep_CompactLoader
----- End Fatal Exception -------------------------------------------------

@silviodonato
Copy link
Contributor

Meanwhile I've found something weird

cmsDriver.py TTbar_13TeV_TuneCUETP8M1_cfi  --conditions auto:phase1_2021_realistic -n 10 --era Run3 --eventcontent FEVTDEBUG --procModifiers dd4hep --relval 9000,50 -s GEN,SIM --datatier GEN-SIM --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --geometry DD4hepExtended2021 --fileout file:step1.root  --no_exec

edmConfigDump  TTbar_13TeV_TuneCUETP8M1_cfi_GEN_SIM.py > dump.py

[sdonato@cmsanalysis DD4HEP]$ cat dump.py  | grep DDD
    fromDDD = cms.bool(False),
    DDDetector = cms.ESInputTag("",""),
    fromDDD = cms.bool(False),
    fromDDD = cms.untracked.bool(False)
    fromDDD = cms.bool(False)
process.hcalDDDRecConstants = cms.ESProducer("HcalDDDRecConstantsESModule",
process.hcalDDDSimConstants = cms.ESProducer("HcalDDDSimConstantsESModule",
    fromDDD = cms.bool(False),
    DDDetector = cms.ESInputTag("",""),
    fromDDD = cms.bool(False),
    fromDDD = cms.bool(True)
    fromDDD = cms.bool(True)
    fromDDD = cms.bool(False)
process.DDDetectorESProducer = cms.ESSource("DDDetectorESProducer",

fromDDD = cms.bool(True) looks suspicious

@cvuosalo
Copy link
Contributor

In CMSSW_11_2_X_2020-11-18-2300, I ran:

runTheMatrix.py -l 11624.911,11642.911 --ibeos

They run all steps successfully and do not show the errors mentioned in this issue.
What is the difference between my test and an IB test?

@cvuosalo
Copy link
Contributor

Note that in the IB, 11624.911 shows this error:

 [0] Running EventSetup component GEMGeometryESModule/'gemGeometry
Exception Message:
A std::exception was thrown.
dd4hep:  : value=384 [Evaluation error]

while 11642.911 shows this one:

   [1] Calling method for module SiStripFEDMonitorPlugin/'siStripFEDMonitor'
   [2] Using EventSetup component TkDetMapESProducer/'' to make data TkDetMap/'' in record TrackerTopologyRcd
   [3] Running EventSetup component TrackerGeometricDetESModule/'trackerNumberingGeometry
Exception Message:
A std::exception was thrown.
dd4hep: Evaluator : systax error : value=0.10349 [Evaluation error]

Could they be spurious? Could we really have two different real errors that somehow do not show up in the PR tests or direct runTheMatrix tests?

@makortel
Copy link
Contributor

One clear difference between PR and IB tests is that PR tests are single-thread and IB tests are multi-threaded.

@silviodonato
Copy link
Contributor

Right, the PR tests are ok indeed #31220 (comment)

@silviodonato
Copy link
Contributor

The command used in the IB test is runTheMatrix.py -l limited -i all --job-reports -t 4 --ibeos

@silviodonato
Copy link
Contributor

@cvuosalo @bsunanda is it ok this line ? It consumes DDDetector if fromDD4hep_ and we have DDDetector = cms.ESInputTag("","")
https://github.com/cms-sw/cmssw/blob/master/Geometry/DTGeometryBuilder/plugins/DTGeometryESModule.cc#L111

process.DTGeometryESModule = cms.ESProducer("DTGeometryESModule",
    DDDetector = cms.ESInputTag("",""),
    alignmentsLabel = cms.string(''),
    appendToDataLabel = cms.string(''),
    applyAlignment = cms.bool(True),
    attribute = cms.string('MuStructure'),
    fromDD4hep = cms.bool(True),
    fromDDD = cms.bool(False),
    value = cms.string('MuonBarrelDT')
)

process.idealForDigiDTGeometry = cms.ESProducer("DTGeometryESModule",
    DDDetector = cms.ESInputTag("",""),
    alignmentsLabel = cms.string('fakeForIdeal'),
    appendToDataLabel = cms.string('idealForDigi'),
    applyAlignment = cms.bool(False),
    attribute = cms.string('MuStructure'),
    fromDD4hep = cms.bool(True),
    fromDDD = cms.bool(False),
    value = cms.string('MuonBarrelDT')
)

@cvuosalo
Copy link
Contributor

I ran the workflows with "-t 4" to allow multi-threading. One ran to completion successfully, and one crashed in step3, with a different message than before.
Multi-threading seems to induce semi-random behavior, at least in step3. Somehow the threads are occasionally stomping on each other. How can this problem be debugged?

@fabiocos
Copy link
Contributor

@cvuosalo I confirm, in single threaded mode the workflow runs smoothly, in multi-thread I got a failure in step2 due to the evaluation of a constant in dd4hep. It looks like the issue depends on the memory access of a single job, this would explain why it does not seem to be systematically reproducible.

@cvuosalo
Copy link
Contributor

Can we run these two workflows in the IB tests in single-threaded mode for now?

@silviodonato
Copy link
Contributor

Can we run these two workflows in the IB tests in single-threaded mode for now?

The PR tests run single thread, so we are checking DD4HEP in each PR and this does not create any problem.
About the IB tests I think it is good to keep it multi-threaded even if it is crashing right now

@makortel
Copy link
Contributor

#32249 suggests a culprit for the failures with multiple threads

@makortel
Copy link
Contributor

The DD4Hep workflows seem also to generate large number of differences (possibly randomly) in PR tests, see e.g. #32270 (comment). Should we consider removing them from the PR tests until they become more stable?

@bsunanda
Copy link
Contributor

bsunanda commented Nov 25, 2020 via email

@makortel
Copy link
Contributor

The ASAN failure in #32181 (comment) reproduces when run with single thread. Maybe that would be a good starting point for further debugging? (also #32181 (comment) and #32181 (comment))

@silviodonato
Copy link
Contributor

solved by #32371
please note #32249 is still open

@cvuosalo
Copy link
Contributor

cvuosalo commented Dec 9, 2020

In CMSSW_11_3_X_2020-12-08-1100, I have confirmed that workflows 11624.911 and 11642.911 run successfully to completion with 1000 events in both single- and multi-threaded mode.

@dan131riley
Copy link

Perhaps the error in ASAN might be helpful

I think the ASAN problem comes from the dd4hep plugin manager trying to use the type of the DDDefinition_XML_reader plugin before the plugin has been loaded. It is unrelated to the other exception. Unfortunately, I don't see an obvious way to fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants