Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

g4SimHitsCTPPSPixelHits not found #31991

Closed
silviodonato opened this issue Oct 30, 2020 · 29 comments
Closed

g4SimHitsCTPPSPixelHits not found #31991

silviodonato opened this issue Oct 30, 2020 · 29 comments

Comments

@silviodonato
Copy link
Contributor

Since CMSSW_11_2 2020-10-29-2300 we get

----- Begin Fatal Exception 30-Oct-2020 07:26:13 CET-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 1 lumi: 36 event: 3505 stream: 2
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module CTPPSPixelDigiToRaw/'ctppsPixelRawData'
   [5] Calling method for module CTPPSPixelDigiProducer/'RPixDetDigitizer'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: CrossingFrame<PSimHit>
Looking for module label: mix
Looking for productInstanceName: g4SimHitsCTPPSPixelHits

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------

because of #31943

https://cmssdt.cern.ch/SDT/html/cmssdt-ib/#/relVal/CMSSW_11_2/2020-10-29-2300?selectedArchs=cc8_amd64_gcc8&selectedFlavors=X&selectedStatus=failed

@silviodonato
Copy link
Contributor Author

assign simulation

@cmsbuild
Copy link
Contributor

New categories assigned: simulation

@mdhildreth,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @silviodonato Silvio Donato.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@civanch
Copy link
Contributor

civanch commented Oct 30, 2020

@silviodonato , situation is simple: PPS hits are not present in premix sample. When @mundim tested #31943 things were working: hits were produced in step1 and product exists at step2. When it was merged we start to face situation that in all previously produced MC samples there is no PPS hits.

Ho to resolve this situation in an optimal way? Can this situation be resolved in python scripts or on level of producers?

@civanch
Copy link
Contributor

civanch commented Oct 30, 2020

May be solution would be to have a separate WFs for Run-3 PPS developments? In other WFs we may not enable PPS simulation.

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 30, 2020 via email

@silviodonato
Copy link
Contributor Author

Thanks, I would temporary revert the PR and look for a solution next week.
This problem looks similar to issue #29216

@mundim
Copy link
Contributor

mundim commented Oct 31, 2020

Hi all. Sorry, I was unable to look my email today. The commend of @civanch said it all. I'll see the other PR and see if there is anything I should do. thanks you all.

@silviodonato
Copy link
Contributor Author

Ok, I removed it for the time being
#31997

@silviodonato
Copy link
Contributor Author

I opened back #32003 as a placeholder.

How can we have problems with the premixing samples even in no-pileup workflows?

For instance
11634.0 TTbar_14TeV+2021+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST+ALCA

is made of
step1
dasgoclient --limit 0 --query 'file dataset=/RelValTTbar_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM site=T2_CH_CERN' | ibeos-lfn-sort -u > step1_dasquery.log 2>&1

and step2
cmsDriver.py step2 --conditions auto:phase1_2021_realistic -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --datatier GEN-SIM-DIGI-RAW -n 10 --geometry DB:Extended --era Run3 --eventcontent FEVTDEBUGHLT --customise Validation/Performance/TimeMemorySummary.customiseWithTimeMemorySummary --prefix '/data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000 ' --filein filelist:step1_dasquery.log --fileout file:step2.root --suffix "-j JobReport2.xml " --nThreads 4 > step2_TTbar_14TeV+2021+TTbar_14TeV_TuneCP5_GenSimINPUT+Digi+Reco+HARVEST+ALCA.log 2>&1

And we get error in step2:

----- Begin Fatal Exception 30-Oct-2020 07:08:15 CET-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 1 lumi: 13 event: 1208 stream: 1
   [1] Running path 'HLTAnalyzerEndpath'
   [2] Prefetching for module L1TRawToDigi/'hltGtStage2Digis'
   [3] Prefetching for module RawDataCollectorByLabel/'rawDataCollector'
   [4] Prefetching for module CTPPSPixelDigiToRaw/'ctppsPixelRawData'
   [5] Calling method for module CTPPSPixelDigiProducer/'RPixDetDigitizer'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: CrossingFrame<PSimHit>
Looking for module label: mix
Looking for productInstanceName: g4SimHitsCTPPSPixelHits

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------

Am I missing something?

@silviodonato
Copy link
Contributor Author

Just to keep track of the errors we got
image

@davidlange6
Copy link
Contributor

davidlange6 commented Nov 3, 2020 via email

@silviodonato
Copy link
Contributor Author

I investigated why 11634.0 works in the PR test of #32003 and why 11634.0 crashed in CMSSW_11_2 2020-10-29-2300 (see above).
The reason is that in IB test we use the -i all --ibeos options which runs

11834.0_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoPU+HARVESTPU+Nano

dasgoclient --limit 0 --query 'file dataset=/RelValTTbar_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM site=T2_CH_CERN' | ibeos-lfn-sort > step1_dasquery.log  2>&1
 
cmsDriver.py step2  --conditions auto:phase1_2021_realistic --pileup_input das:/RelValMinBias_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM -n 10 --era Run3 --eventcontent FEVTDEBUGHLT -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --datatier GEN-SIM-DIGI-RAW --pileup Run3_Flat55To75_PoissonOOTPU --geometry DB:Extended -n 1 --filein filelist:step1_dasquery.log --fileout file:step2.root  > step2_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSimINPUT+DigiPU+RecoPU+HARVESTPU+Nano.log  2>&1

which takes as input the already existing GEN-SIM files

/store/relval/CMSSW_10_6_1/RelValTTbar_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/065743BF-0813-6D4B-8F5D-0804551AF358.root
/store/relval/CMSSW_10_6_1/RelValTTbar_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/4842CFA0-95DE-5D48-8061-11A08BF5C32A.root
/store/relval/CMSSW_10_6_1/RelValTTbar_14TeV/GEN-SIM/106X_mcRun3_2021_realistic_v1_rsb-v1/10000/C8386AF9-BC0E-504D-9555-B3DE021FAC36.root

On contrary, the default runTheMatrix.py -l 11834.0 regenerates the GEN-SIM

11834.0_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+HARVESTPU+Nano

cmsDriver.py TTbar_14TeV_TuneCP5_cfi  --conditions auto:phase1_2021_realistic -n 10 --era Run3 --eventcontent FEVTDEBUG --relval 9000,100 -s GEN,SIM --datatier GEN-SIM --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --geometry DB:Extended -n 1 --fileout file:step1.root  > step1_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+HARVESTPU+Nano.log  2>&1

cmsDriver.py step2  --conditions auto:phase1_2021_realistic --pileup_input das:/RelValMinBias_14TeV/CMSSW_10_6_1-106X_mcRun3_2021_realistic_v1_rsb-v1/GEN-SIM -n 10 --era Run3 --eventcontent FEVTDEBUGHLT -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --datatier GEN-SIM-DIGI-RAW --pileup Run3_Flat55To75_PoissonOOTPU --geometry DB:Extended -n 1 --filein  file:step1.root  --fileout file:step2.root  > step2_TTbar_14TeV+2021PU+TTbar_14TeV_TuneCP5_GenSim+DigiPU+RecoPU+HARVESTPU+Nano.log  2>&1

@silviodonato
Copy link
Contributor Author

@cms-sw/pdmv-l2 we need to update https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L3238 to a sample made with CMSSW_11_2_0_pre6 or later (ie. including #30575 that integrates PPS into the simulation).
The same thing has to be done also for the pileup samples, otherwise the pileup samples will not include PPS.
It looks like the new pileup samples need to be defined here https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L674 and used in the rest of the code.
Please note that PPS has been included only in the Run-3 simulation.

@makortel
Copy link
Contributor

makortel commented Nov 6, 2020

I wonder if a way to add -i all to PR tests would be useful in general (@smuzaffar). Of course then the challenge would be to remember to test also that for PRs where it would be relevant, so it is not obvious if that would really improve the situation compared to catching it in IBs.

@smuzaffar
Copy link
Contributor

how about we add an extra test where we run with -i all --command "-n 1" this should catch any misconfiguration and any missing inputdata

@silviodonato
Copy link
Contributor Author

@makortel @makortel I'm ok with adding -i all --command "-n 1". I'm a bit afraid that it will take a lot of time even if we run a single event. Perhaps --maxStep=2 might help to limit the running time

@silviodonato
Copy link
Contributor Author

@cms-sw/pdmv-l2 we can discuss about #31991 (comment) in the ORP meeting

@smuzaffar
Copy link
Contributor

cms-sw/cms-bot#1412 adds a new test which runs runTheMatrix.py -all --maxStep=0 which basically runs all the das query commands for all of our workflows. It takes only 5 to 8 mins to run it. Once this is merged the bot will report on the PR summary page ( https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-118832/10645/summary.html )about the failed das-queries

@makortel
Copy link
Contributor

I'm a bit confused how das query alone would have shown that using the old/existing GEN-SIM would lead to failures in step2.

@smuzaffar
Copy link
Contributor

No, it is not going to solve that issue. It is only going to solve the issue where new data set is added which is not used by short matrix. Those error will be caught by this test.

@silviodonato
Copy link
Contributor Author

Thanks @smuzaffar . About #31991 (comment), I think the current situation where we catch the error in the IB is not so bad, but of course it would be good to have the possibility to run the PR test with -i all. This option might be useful also to speed up the PR tests which does not affect gen-sim. On the other hand, it is good to test the workflows without -i all at least in the PR tests.

@smuzaffar
Copy link
Contributor

@silviodonato , I think we can easily add -i all --command "-n 1" --maxStep=2 and only run those workflows where we use input data in step1. I will add such a test to see how much time it will take.

@Martin-Grunewald
Copy link
Contributor

Looks like CMSSW_10_2_0_pre6 files do NOT help failing TSG tests... #31991 (comment)

will try CMSSW_10_2_0_pre8 files.

@silviodonato
Copy link
Contributor Author

solved by #32003 and #32140 (see #32125)

@smuzaffar
Copy link
Contributor

@silviodonato , @makortel as bot now runs tests in parallel (on different machines), do we want to run an additional relval test with -i all --command "-n 1" --maxStep=2 . It should not increase the PR test time as the normal Relval/comparison job takes 2 hours and running relvals with these selected options takes around 1 hour. So hopefully with enough resources, the new test can finish before the standard relvals tests. cms-sw/cms-bot#1464 adds this additional test. If this type of test is still usefull then I can merge the cms-bot PR.

@silviodonato
Copy link
Contributor Author

@smuzaffar If you think this does not take too much resources, I'm ok with it. You might discuss about it tomorrow.
I'm a bit confused of the timing you quoted, 2 hours vs 1 hour.
I expected a larger difference between -i all --command "-n 1" --maxStep=2 and --command "-n 10"

@smuzaffar
Copy link
Contributor

For normal relval tests i.e. short matrix + some selected wf (around 40 wfs) , it takes 1 hour for runTheMatrix and 1 hour for comparison i.e. over all 2 hours. -i all --command "-n 1" --maxStep=2 for all the wfs with input dataset (total of 576 wfs) it just need 1 hour. As both tests will run in parallel so the overall PR tests should still finish within 2 to 2.5 hours

@silviodonato
Copy link
Contributor Author

Ok, now I understand why it takes 1 hours (576 wfs vs 40 wfs).
This test would be very useful also to check the technical functioning of step-2 of many more workflows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants