Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GEN/LHE weight validation subdirectories for GEN Relval #36994

Merged
merged 13 commits into from
Mar 7, 2022

Conversation

SanghyunKo
Copy link
Contributor

PR description:

Adding Relval subdirectories GenWeight & LHEWeight to detect odds in GEN/LHE weight contents, to avoid any potential issues that we had experienced before - such as #36705, #27918, cms-sw/cmsdist#6688, hypernews, and many other...

Each GEN/LHE subdirectory contains the number of weights, distribution of weights (normalized to the nominal), leading lepton/jet Pt & η. In addition, GEN directory contains ISR/FSR up/down (2 or 1/2) variations and their ratio to the nominal, while LHE directory contains envelop of scale variations and PDF uncertainty (± RMS) and their ratio to the nominal. Assumed 9 scale variations & 103 PDF variations, which should hold for most of the cases.

Demo of Relval plots with workflow 556 (TTbar Powheg+Pythia8) would look like this.

PR validation:

Tested with following GEN workflows:

  • 504 (QCD Pt-30 Pythia8) - no LHE or GEN weight
  • 555 (DY+jets aMCatNLO+Pythia8) - has LHE but no GEN (PS) weight
  • 556 (TTbar Powheg+Pythia8) - has both LHE & GEN weights

as the routine will run for all GEN workflows, there should be no exception regardless of having LHE products or not.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36994/28387

  • This PR adds an extra 64KB to repository

  • Found files with invalid states:

    • Validation/EventGenerator/interface/GenPtcValidationHelper.h:
    • Validation/EventGenerator/src/GenPtcValidationHelper.cc:

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @SanghyunKo (Sanghyun Ko) for master.

It involves the following packages:

  • Configuration/Generator (generators)
  • Validation/EventGenerator (dqm, generators)

@SiewYan, @mkirsano, @emanueleusai, @ahmad3213, @cmsbuild, @GurpreetSinghChahal, @jfernan2, @Saptaparna, @alberto-sanchez, @pmandrik, @pbo0, @rvenditti can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @missirol, @fabiocos this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@SanghyunKo
Copy link
Contributor Author

FYI @agrohsje @Dongwoon77 @shimashimarin

@jfernan2
Copy link
Contributor

@SanghyunKo please add yourself along with your github username in comments in the DQM GEN Validators e-group:
https://e-groups.cern.ch/e-groups/Egroup.do?egroupName=cms-dqm-validation-developers-gen&tab=3
to keep track of the developers
Thanks

@SiewYan
Copy link
Contributor

SiewYan commented Feb 18, 2022

please test workflow 504, 555, 556

@SanghyunKo
Copy link
Contributor Author

@jfernan2 Thanks for letting me know, I've added myself to the e-group.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2f4d7b/22493/summary.html
COMMIT: eaab8d7
CMSSW: CMSSW_12_3_X_2022-02-17-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36994/22493/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 1001.0, 1000.0, 136.88811, 136.874, 136.8311, 136.793, 136.7611, 136.731, 4.22 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/138.4_PromptCollisions+RunMinimumBias2021+ALCARECOPROMPTR3+HARVESTDPROMPTR3
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/138.5_ExpressCollisions+RunMinimumBias2021+TIER0EXPRUN3+ALCARECOEXPR3+HARVESTDEXPR3
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/139.001_RunMinimumBias2021+RunMinimumBias2021+HLTDR3_2021+RECODR3_MinBiasOffline+HARVESTD2021MB
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/504.0_QCD_Pt-30_13TeV_pythia8+QCD_Pt-30_13TeV_pythia8+HARVESTGEN
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/555.0_DYTollJets_NLO_Mad_13TeV_py8+DYToll012Jets_5f_NLO_FXFX_Madgraph_LHE_13TeV+Hadronizer_TuneCP5_13TeV_aMCatNLO_FXFX_5f_max2j_max0p_LHE_pythia8+HARVESTGEN2
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/556.0_TTbar_NLO_Pow_13TeV_py8+TTbar_Pow_LHE_13TeV+Hadronizer_TuneCP5_13TeV_powhegEmissionVeto2p_pythia8+HARVESTGEN2

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3965143
  • DQMHistoTests: Total failures: 19
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3965101
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 455.331 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 10024.0,... ): 12.002 KiB Generator/LHEWeight
  • DQMHistoSizes: changed ( 10024.0,... ): 11.963 KiB Generator/GenWeight
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

@SanghyunKo I understand all the histograms added:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_12_3_X_2022-02-17-1100+2f4d7b/48440/dqm-histo-comparison-summary.html
are empty since the MC WFs tested:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2f4d7b/22493/runTheMatrix-results/
are Pythia based, hence LHE weights are null or constant.

I wonder if you could add some switch in the code to only produce them when the WF is based on an external LHE generator

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

Pull request #36994 was updated. @SiewYan, @mkirsano, @emanueleusai, @ahmad3213, @cmsbuild, @GurpreetSinghChahal, @jfernan2, @Saptaparna, @alberto-sanchez, @pmandrik, @pbo0, @rvenditti can you please check and sign again.

@perrotta
Copy link
Contributor

perrotta commented Mar 7, 2022

please test workflow 504, 555, 556

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2f4d7b/22898/summary.html
COMMIT: eaec248
CMSSW: CMSSW_12_3_X_2022-03-06-2300/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36994/22898/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/504.0_QCD_Pt-30_13TeV_pythia8+QCD_Pt-30_13TeV_pythia8+HARVESTGEN
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/555.0_DYTollJets_NLO_Mad_13TeV_py8+DYToll012Jets_5f_NLO_FXFX_Madgraph_LHE_13TeV+Hadronizer_TuneCP5_13TeV_aMCatNLO_FXFX_5f_max2j_max0p_LHE_pythia8+HARVESTGEN2
  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-2f4d7b/556.0_TTbar_NLO_Pow_13TeV_py8+TTbar_Pow_LHE_13TeV+Hadronizer_TuneCP5_13TeV_powhegEmissionVeto2p_pythia8+HARVESTGEN2

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3987741
  • DQMHistoTests: Total failures: 14
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3987705
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 227.297 KiB( 48 files compared)
  • DQMHistoSizes: changed ( 10024.0,... ): 11.963 KiB Generator/GenWeight
  • Checked 204 log files, 45 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

jfernan2 commented Mar 7, 2022

+1

@Saptaparna
Copy link
Contributor

Saptaparna commented Mar 7, 2022

+1
from generators (sorry for the delay)

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 7, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Mar 7, 2022

+1

Copy link
Contributor

@perrotta perrotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SanghyunKo I was investigating the possible origin of the comparison errors that show up for wf 135.4 in the recent PR tests for the histogram made by this wgtVal_, which suggests a non reproducibility issue possibly due to some non initialized values.
In this line of the code I found a possible culprit: if I am not wrong this should have been normalized to the first weight in the vector, i.e. index 0, while if you normalize to the second element in the vector (index 1) you can get some undefinite result in case the number of elements in the vector is lower than two,
Could you please check at your earliest, and either apply this fix (if you think it is correct), or find and then implement the appropriate one? Thank you.

nlogWgt_->Fill(std::log10(weights_.at(idxGenEvtInfo_).size()), weight_);

for (unsigned idx = 0; idx < weights_.at(idxGenEvtInfo_).size(); idx++)
wgtVal_->Fill(weights_.at(idxGenEvtInfo_)[idx] / weights_.at(idxGenEvtInfo_)[1], weight_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be

    wgtVal_->Fill(weights_.at(idxGenEvtInfo_)[idx] / weights_.at(idxGenEvtInfo_)[0], weight_);

instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@perrotta this was intended on purpose since the proper normalization of PS weight is done by dividing weights to the baseline weight, which is located at idx 1 (Twiki). But indeed I can place a protector for the case weights_.at(idxGenEvtInfo_).size()=1 between other protectors for size()=0 case and size()<=idxMax_ case

(filling weights_.at(idxGenEvtInfo_)[0]/weights_.at(idxGenEvtInfo_)[0] for size()=1 case is redundant)

  if (weights_.at(idxGenEvtInfo_).size()<2)
    return;  // no baseline weight in GenEventInfo

  for (unsigned idx = 0; idx < weights_.at(idxGenEvtInfo_).size(); idx++)
    wgtVal_->Fill(weights_.at(idxGenEvtInfo_)[idx] / weights_.at(idxGenEvtInfo_)[1], weight_);

@perrotta
Copy link
Contributor

perrotta commented Mar 9, 2022

Thank you @SanghyunKo

Would it be possible that there is not such baseline weight, only the nominal one (i.e. weights_.at(idxGenEvtInfo_).size() = 1)?

Did you check whether this was really the origin of the supposed non-sensical entries in the wgtVal_ histo?

I think we can submit a fix PR with your suggestion, which is a reasonable one even if it does not fix the issue seen in the PR comparisons. But it wouldn't be bad if we could make some simple check to verify that it actually fixes it.

@SanghyunKo
Copy link
Contributor Author

@perrotta would you mind providing some snippet or link to the failing comparison you mentioned? It's not clear to me what you're referring to... If I get it then I can definitely run a quick test for it.

As for the number of weights, there should be both nominal & baseline weight when we have PS weights, but I realized that this isn't always true when we talk about Relvals (though it is mostly true in official samples).

@Dr15Jones
Copy link
Contributor

The ASAN build is reporting a out-of-bounds memory read coming from GenWeightValidation::analyze

=================================================================
==17313==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000164e98 at pc 0x2aaf37f96ed5 bp 0x2aaf3dc2f990 sp 0x2aaf3dc2f988
READ of size 8 at 0x602000164e98 thread T3
    #0 0x2aaf37f96ed4 in GenWeightValidation::analyze(edm::Event const&, edm::EventSetup const&) (/cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_ASAN_X_2022-03-09-1100/lib/slc7_amd64_gcc11/pluginValidationEventGenerator_plugins.so+0xffed4)
    #1 0x2aaef9168d47 in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (/cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_ASAN_X_2022-03-09-1100/lib/slc7_amd64_gcc11/libFWCoreFramework.so+0x8cfd47)
[cut]

0x602000164e98 is located 0 bytes to the right of 8-byte region [0x602000164e90,0x602000164e98)
allocated by thread T3 here:
    #0 0x2aaef7f77d07 in operator new(unsigned long) ../../../../libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x2aaef9fdf1ac in void std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > >::_M_realloc_insert<std::vector<double, std::allocator<double> > const&>(__gnu_cxx::__normal_iterator<std::vector<double, std::allocator<double> >*, std::vector<std::vector<double, std::allocator<double> >, std::allocator<std::vector<double, std::allocator<double> > > > >, std::vector<double, std::allocator<double> > const&) (/cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_ASAN_X_2022-03-09-1100/external/slc7_amd64_gcc11/lib/libMathCore.so+0xbc1ac)

SUMMARY: AddressSanitizer: heap-buffer-overflow (/cvmfs/cms-ib.cern.ch/nweek-02723/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_ASAN_X_2022-03-09-1100/lib/slc7_amd64_gcc11/pluginValidationEventGenerator_plugins.so+0xffed4) in GenWeightValidation::analyze(edm::Event const&, edm::EventSetup const&)
Shadow bytes around the buggy address:
  0x0c0480024980: fa fa 00 fa fa fa fd fa fa fa fa fa fa fa fd fa
  0x0c0480024990: fa fa 00 00 fa fa 00 00 fa fa fd fd fa fa fd fd
  0x0c04800249a0: fa fa fd fa fa fa 00 fa fa fa 00 fa fa fa fd fa
  0x0c04800249b0: fa fa fd fd fa fa fd fa fa fa fd fd fa fa 00 00
  0x0c04800249c0: fa fa fd fd fa fa fd fa fa fa fd fa fa fa fd fd
=>0x0c04800249d0: fa fa 00[fa]fa fa fd fa fa fa fd fd fa fa fd fd
  0x0c04800249e0: fa fa fd fa fa fa fd fd fa fa fd fa fa fa fd fd
  0x0c04800249f0: fa fa fd fd fa fa 00 fa fa fa 00 00 fa fa 00 00
  0x0c0480024a00: fa fa fd fd fa fa fd fd fa fa fa fa fa fa fd fd
  0x0c0480024a10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480024a20: fa fa 00 fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==17313==ABORTING

@perrotta
Copy link
Contributor

perrotta commented Mar 9, 2022

Thank you @Dr15Jones : your observation is perfectly in line with what was discussed earlier on in this thread.

Since @SanghyunKo already proposed a fix, but such a fix was not submitted yet, I shamelessly copied the very same solution proposed by @SanghyunKo in a new PR, #37185, in order to speed up the possible integration in CMSSW in time for 12_3_0_pre6.

Of course, if @SanghyunKo or anyone else finds a more appropriate solution, that PR can be closed and we can move to the new one.

@SanghyunKo
Copy link
Contributor Author

Thanks @perrotta and no problem at all, it's my bad. I was struggling to reproduce the failing DQM comparison (but I couldn't) but the address sanitizer is telling us the fix is needed anyway indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants