-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long startup time for simplest workflows #43062
Comments
A new Issue was created by @felicepantaleo Felice Pantaleo. @antoniovilela, @makortel, @Dr15Jones, @rappoccio, @smuzaffar, @sextonkennedy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core,upgrade |
New categories assigned: core,upgrade @Dr15Jones,@AdrianoDee,@makortel,@smuzaffar,@srimanob you have been requested to review this Pull request/Issue and eventually sign? Thanks |
So I used my recent changes to the Tracer service to gather statistics on your job. What I found was the following break down
So the obvious speed ups would be to
|
Thanks very much @Dr15Jones for analysis. |
I investigated the first 3 steps of the workflow with VTune (as @felicepantaleo told me it felt like all the steps have long startup time). On step 1 (GEN-SIM) the job divides clearly into three phases The first phase is the overall job startup (shown in the top thread), taking about 102 seconds wall clock time, and 85 seconds of CPU time. The CPU time divides into
Here is flame graph of what happens in The question is then, does it really need to be this expensive? @cms-sw/geometry-l2 Second phase (middle thread in the timeline) is the Geant4 initialization in the Third phase (mostly in the bottom thread in the timeline) is the actual event processing (most time spent in Geant4, so shows in a separate non-TBB thread), taking about 45 seconds for 10 events. |
assign geometry |
New categories assigned: geometry @Dr15Jones,@civanch,@bsunanda,@makortel,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220 second CPU time. The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116 Sequences, and 157 Tasks. The startup phase (72 seconds CPU time) divides into
@cms-sw/l1-l2 Is there anything that could be done to speed up the constructors of the aforementioned 3 L1T modules? Next is beginRun transition (including most, if not all, of EventSetup), about 100 seconds of CPU time, divided into
Then comes the event data processing, 63 seconds for 10 events
|
assign l1 |
New categories assigned: l1 @epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The step 3 (RECO+PAT+VALIDATION+DQM) took 290 seconds wall clock time with 260 seconds of CPU time. The configuration defines 5431(!) EDModules, 483 ESModules, 72 (End)Paths, 1291 Sequences, and 758 Tasks. The startup phase took about 170 seconds wall clock time (150 seconds CPU time), divided into (in CPU time)
@cms-sw/xpog-l2 I found that the The beginRun transition took about 93 seconds of wall clock time (87 seconds CPU time), divided into (in CPU time)
Event data processing and shutdown took about 23 seconds in wall clock time (22.5 seconds in CPU time), divided into (in CPU time)
|
assign xpog |
New categories assigned: xpog @vlimant,@simonepigazzini you have been requested to review this Pull request/Issue and eventually sign? Thanks |
assign db For the cost of
|
New categories assigned: db @francescobrivio,@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks |
+core I think our part (in the analysis) is done |
re MuonOffsetESProducer and mtdNumberingGeometry - are these not just the unlikely module that builds the entire geometry? [which might be much faster if it were in the db] |
As a test I did
Then in python I did
and then
so using a pickle file is 25x faster |
it also seems that Phase 2 is still using the old DD instead of DD4Hep. |
I believe that is true - (so startup will get slower with DD4Hep I guess) |
@Dr15Jones asked for the flame graph for |
looking at some of the others here
probably helps a lot (I don't really understand the resize(...) followed by a memcpy going on, so there might be another factor to gain in this code |
Do we really gain something in performance with dd4hep? It was also a question I get from CHEP, but I never have a clear result of gaining. By the way, we are trying to move. Last validation still see few issues, and we plan to do again with coming 13_3_0_pre4 |
My take is we'd rather spend time optimizing code using DD4Hep then spend the time on the obsolete DD. |
This should not even be executed in phase2 |
I confirm that in this workflow the |
I thought we cleaned that here #42794 |
The reported behavior was in CMSSW_13_2_0_pre3 (the job used |
Yours is the name I recall in connection with the correlator team and Phase2L1ParticleFlow (That is correct right? My apologies if not). From my look at the cmssw/L1Trigger/Phase2L1ParticleFlow/plugins/L1TPFCaloProducer.cc Lines 73 to 77 in 245daa4
Which calls this constructor here: cmssw/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc Lines 34 to 38 in 245daa4
init_ with two strings and two bools here: cmssw/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc Lines 52 to 83 in 245daa4
This is mostly file checks which eventually calls another init_ here: cmssw/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc Lines 85 to 144 in 245daa4
initEmulation_ here: cmssw/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc Lines 146 to 175 in 245daa4
Notably, each of the init functions does some loop through Would it be possible to ask if you or one of the people responsible for Phase2L1ParticleFlow could take a look at this and see if where the slowdown occurs and if it could be reworked? You are listed as having most recently added this line:
DTTriggerPhase2Prod
Which in turn calls a for loop doing a long list of mathematical functions here (listed with the atan calculation): cmssw/L1Trigger/DTTriggerPhase2/src/GlobalCoordsObtainer.cc Lines 43 to 195 in 245daa4
which seems a primary candidate for slowdown in this function. Could you, or someone responsible for this code look into the performance of this, and see if it could be reworked? If not, would it maybe be possible to cache or look up the results of these math functions/loops? At a quick pass these seem to be calculated against some constants. You are listed as authors on Could you, or someone responsible for this code look into the performance of this and see if it could be reworked? |
Hi,
I can have a look but not with terribly high priority - after all, we're
discussing shaving off seconds of initialization, once one runs a workflow
with more events it becomes irrelevant.
The two PF modules mentioned have to load a bunch of calibration histograms
(in several bins of eta and em fraction), and BDTs for cluster
identification, so they do need some time.
Also, eventually both features will anyway have to be replaced with
bitwise-accurate emulators (timescale of some months, at most a year)
Giovanni
Il Ven 20 Ott 2023, 17:54 Andrew Loeliger ***@***.***> ha
scritto:
… The step 2 (DIGI-HLT) took in total 240 seconds wall clock time and 220
second CPU time.
The configuration defines 901 EDModules, 208 ESModules, 57 (End)Paths, 116
Sequences, and 157 Tasks.
The startup phase (72 seconds CPU time) divides into
-
5 seconds of reading configuration
-
46 seconds in EventProcessor constructor
-
27 seconds in StreamSchedule constructor, i.e. loading libraries,
creating modules, registering data products (of which a lot of time is
spent in cling), and all that
-
12 seconds spent in Maker::makeModule, of which 8 seconds in
creating edm::stream modules. *This 8 seconds increases with the
number of streams, but is run serially!*
-
3 seconds in L1TPFCaloProducer constructor
- Most of the time seems to be spent in reading data from a
ROOT file via l1tpf::corrector::init_()
-
2 seconds in DTTrigPhase2Prod constructor
- Most of the time seems to be spent in
GlobalCoordsObtainer::generate_luts()
-
2 seconds in l1tpf::PFClusterProducerFromHGC3DClusters
constructor
- Most of the time seems to be spent in creating cut parser,
and creating expression parser in l1tpf::HGC3DClusterEgID
constructor
-
0.3 seconds in MixingModule constructor, rest is below that
-
1 second constructing the source
@cms-sw/l1-l2 <https://github.com/orgs/cms-sw/teams/l1-l2> Is there
anything that could be done to speed up the constructors of the
aforementioned 3 L1T modules?
@gpetruc <https://github.com/gpetruc>
Yours is the name I recall in connection with the correlator team and
Phase2L1ParticleFlow (That is correct right? My apologies if not). From my
look at the L1TPFCaloProducer code, 3 correctors get produced from a
string, an int, and a debug bool:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/plugins/L1TPFCaloProducer.cc#L73-L77
Which calls this constructor here:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L34-L38,
calling init_ with two strings and two bools here:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L52-L83
This is mostly file checks which eventually calls another init_ here:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L85-L144
and if the emulation flag is set, it calls another initEmulation_ here:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/Phase2L1ParticleFlow/src/corrector.cc#L146-L175
Notably, each of the init functions does some loop through TFile keys,
which would be my first instinct to check for slow down in this constructor.
Would it be possible to ask if you or one of the people responsible for
Phase2L1ParticleFlow could take a look at this and see if where the
slowdown occurs and if it could be reworked?
@NTrevisani <https://github.com/NTrevisani>
You are listed as having most recently added this line:
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/plugins/DTTrigPhase2Prod.cc#L197
to DTTriggerPhase2Prod
Which in turn calls a for loop doing a long list of mathematical functions
here (listed with the atan calculation):
https://github.com/cms-sw/cmssw/blob/245daa4d7ad8e1412084c2d66381edfd8835d7ad/L1Trigger/DTTriggerPhase2/src/GlobalCoordsObtainer.cc#L43-L195
which seems a primary candidate for slowdown in this function. Could you,
or someone responsible for this code look into the performance of this, and
see if it could be reworked? If not, would it maybe be possible to cache or
look up the results of these math functions/loops? At a quick pass these
seem to be calculated against some constants.
@rekovic <https://github.com/rekovic> & @cerminar
<https://github.com/cerminar>
You are listed as authors on PFClusterProducerFromHGC3DClusters and
HGC3DClusterEgID. This one I cannot find a likely candidate for slow-down
on, other than perhaps the instantiation of the TMVA and the creation of
another corrector (see the first section).
Could you, or someone responsible for this code look into the performance
of this and see if it could be reworked?
—
Reply to this email directly, view it on GitHub
<#43062 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDNWR65YPAQI4UAPYD6OXTYAKND7AVCNFSM6AAAAAA6G3HLI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZSHE4TGNZXGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
it is on our to-do list to review how PAT is constructed and simplify it together with dropping |
As an additional test, I dumped the data associated to the global tag used by the job into an HDF5 file by doing
As the tool only handles full global tags and not individual tags, I changed the cmsRun configuration to read the global tag from the HDF5 file but had the individual tags still come from CondDBESSource. The change to the configuration was the addition of the following lines process.GlobalTag.globaltag = ""
process.H5Tag = cms.ESSource("CondHDF5ESSource", filename = cms.untracked.string("133X_mcRun4_realistic_v1.h5conf"), globalTag = cms.string("133X_mcRun4_realistic_v1"),
excludeRecords = cms.vstring("SiPixelGenErrorDBObjectRcd", "SiPixelLorentzAngleRcd",
"SiPixelLorentzAngleSimRcd","SiPixelTemplateDBObjectRcd",
"TrackerAlignmentErrorExtendedRcd",
"TrackerAlignmentRcd",
"TrackerSurfaceDeformationRcd")) Running the original job at FNAL, it took 140s for the job to retrieve all the IOVs during the begin Run phase. With this change, it now takes 1.5s. |
So for local tests one can first dump the global tag and then run. |
The CondCore/CondHDF5ESSource package was first introduced in CMSSW_13_3_0_pre2. See #42431. |
@Dr15Jones - what is the role of excludeRecords here? Just an ability to skip known to be not needed records? Is that responsible for much of the (very interesting..) speedup |
For the configuration given, the CondDBESSource overrides the GlobalTag for those records and gives explicit tags to be used (see below). The CondHDF5ESSource (and its accompanying scripts) only handle using a GlobalTag (they could be extended to handle explicit tags as well) so I needed to still get these explicit tags (and their associated Records) from CondDBESSource. In order to avoid having HDF5 and the DB fight each other over which one would deliver the Record (one could es an ESAlias to explicitly set which one wins) I decided to just have the CondHDF5ESSource not deliver those Record at all. toGet = cms.VPSet(
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('SiPixelGenErrorDBObjectRcd'),
snapshotTime = cms.string('2021-04-17 20:00:00'),
tag = cms.string('SiPixelGenErrorDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('SiPixelLorentzAngleRcd'),
snapshotTime = cms.string('2021-03-16 20:00:00.000'),
tag = cms.string('SiPixelLorentzAngle_phase2_T25_v0_mc')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
label = cms.untracked.string('forWidth'),
record = cms.string('SiPixelLorentzAngleRcd'),
snapshotTime = cms.string('2020-02-23 14:00:00.000'),
tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
label = cms.untracked.string('fromAlignment'),
record = cms.string('SiPixelLorentzAngleRcd'),
snapshotTime = cms.string('2020-02-23 14:00:00.000'),
tag = cms.string('SiPixelLorentzAngle_phase2_T19_mc_forWidthEmpty')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('SiPixelLorentzAngleSimRcd'),
snapshotTime = cms.string('2021-03-16 20:00:00.000'),
tag = cms.string('SiPixelSimLorentzAngle_phase2_T25_v0_mc')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('SiPixelTemplateDBObjectRcd'),
snapshotTime = cms.string('2021-04-17 20:00:00'),
tag = cms.string('SiPixelTemplateDBObject_phase2_IT_v7.0.2_25x100_v2_mc')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('TrackerAlignmentErrorExtendedRcd'),
snapshotTime = cms.string('2023-03-16 15:30:00'),
tag = cms.string('TrackerAlignmentErrorsExtended_Upgrade2026_T25_design_v0')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('TrackerAlignmentRcd'),
snapshotTime = cms.string('2023-03-16 15:30:00'),
tag = cms.string('TrackerAlignment_Upgrade2026_T25_design_v0')
),
cms.PSet(
connect = cms.string('frontier://FrontierProd/CMS_CONDITIONS'),
record = cms.string('TrackerSurfaceDeformationRcd'),
snapshotTime = cms.string('2023-03-16 15:30:00'),
tag = cms.string('TrackerSurfaceDeformations_Upgrade2026_Zero')
)
) |
for the MTD geometry construction part, please have a look at #43124 . The DD4hep version is definitely less problematic as far as MTD is concerned, but not the production default yet. Profiling 24807.911 with IgProf:
|
@Dr15Jones I wasn't able to understand... is the pickled file faster than the original configuration build by |
Faster than the original one from cmsDriver.py. |
+geometry it seems that the main problem in geometry connected with MTD is fixed. |
Developing reconstruction algorithms for Phase-2 is becoming more and more difficult and frustrating as CMSSW startup times have increased in the last few years.
This is what happens when we want to reconstruct 10 events of a single charged pion shot in front of HGCAL with a single thread.
The workflow is
24896.0_CloseByPGun_CE_E_Front_120um+2026D98
usingCMSSW_13_2_0_pre3
It takes 4.5 minutes to launch the job, 30 seconds to reconstruct 10 events with a single thread.
Reducing the startup time would reduce drastically the number of coffees we have every day while increasing morale.
@waredjeb @makortel fyi
The text was updated successfully, but these errors were encountered: