Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in reading back 12_4_HLT_X Run3ScoutingVertex collection from CMSSW_14_0_X #45888

Closed
mmusich opened this issue Sep 5, 2024 · 28 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Sep 5, 2024

@AdrianoDee reported a failure when setting up Summer24 campaigns (see here for some previous discussion).
When running (as agreed elsewhere) the data generation chain in CMSSW_14_0_X (while running the HLT in CMSSW_12_4_HLT_X) a Fatal Root Error occurs:

----- Begin Fatal Exception 05-Sep-2024 08:21:25 UTC-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'RECOSIMoutput_step'
   [2] Prefetching for module PoolOutputModule/'RECOSIMoutput'
   [3] While reading from source std::vector<Run3ScoutingVertex> hltScoutingPrimaryVertexPacker 'primaryVtx' HLT
   [4] Reading branch Run3ScoutingVertexs_hltScoutingPrimaryVertexPacker_primaryVtx_HLT.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::CheckByteCount
object of class vector<Run3ScoutingVertex> read too many bytes: 57 instead of 45
----- End Fatal Exception -------------------------------------------------

Suspicion is that Run3ScoutingVertex data format changed in 14_0_X, that is the release PdmV uses for all the steps but HLT. In this specific case, where new data members have been added to the original format.
To be honest at the moment I am baffled by this.
Backwards compatibility of scouting data-formats should be guaranteed as they are treated as any other "RAW" data-tier.
There are dedicated unit tests that ensure we can continue to be able reading in newer releases HLT scouting files produced in 12_4_0 and 13_0_3, see here and here for more details.
A reproducer (in plain CMSSW_12_4_0) is:

wrkdir=${PWD}/test_${RANDOM}

mkdir $wrkdir

release=CMSSW_12_4_0
if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py SingleMuPt1_pythia8_cfi  -s GEN,SIM -n 10 --conditions 124X_mcRun3_2022_realistic_v5 --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Extended --era Run3 --relval 25000,100 --fileout file:step1.root

cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --filein  file:step1.root  --fileout file:step2.root

release=CMSSW_14_0_0

if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions 140X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --filein  file:step2.root  --fileout file:step3.root
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2024

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

I am not sure if it is relevant, but testing with:

void myScript() {
    // Open the ROOT file
    //TFile *file = TFile::Open("testRun3Scouting_v3_v5_v3_v4_v5_v3_v5_v3_v3_CMSSW_12_4_0.root");
    TFile *file = TFile::Open("step2.root");

    // Check if file is successfully opened
    if (!file || file->IsZombie()) {
        std::cerr << "Error opening file!" << std::endl;
        return;
    }

    // Show streamer information
    file->ShowStreamerInfo();

    // Close the file
    file->Close();
}

either the file out of the test at #45888 (comment) or the dedicated unit test file for CMSSW_12_4_X (here), I am getting different results:
In one case I have:

StreamerInfo for class: edm::Wrapper<vector<Run3ScoutingVertex> >, version=3, checksum=0xfab077ed
  edm::WrapperBase BASE            offset=  0 type= 0
  bool           present         offset=  0 type=18
  vector<Run3ScoutingVertex> obj             offset=  0 type=300 ,stl=1, ctype=61,  (Run3ScoutingVertex)

StreamerInfo for class: Run3ScoutingVertex, version=3, checksum=0x12dc9576
  float          x_              offset=  0 type= 5
  float          y_              offset=  0 type= 5
  float          z_              offset=  0 type= 5
  float          zError_         offset=  0 type= 5
  float          xError_         offset=  0 type= 5
  float          yError_         offset=  0 type= 5
  int            tracksSize_     offset=  0 type= 3
  float          chi2_           offset=  0 type= 5
  int            ndof_           offset=  0 type= 3
  bool           isValidVtx_     offset=  0 type=18

while in the other case (the file coming from the HLT) only:

StreamerInfo for class: edm::Wrapper<vector<Run3ScoutingVertex> >, version=3, checksum=0xfab077ed
  edm::WrapperBase BASE            offset=  0 type= 0
  bool           present         offset=  0 type=18
  vector<Run3ScoutingVertex> obj             offset=  0 type=300 ,stl=1, ctype=61,

@vlimant
Copy link
Contributor

vlimant commented Sep 5, 2024

N.B. changing the split level of the step2 FEVTDEBUGHLT output above to 1 (instead of 0 default), let things to run smoothly

@vlimant
Copy link
Contributor

vlimant commented Sep 5, 2024

do we know how were those testRun3Scouting_*root files created ? I suspect they were split further then what files coming from HLT are ; my 2 cents

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

do we know how were those testRun3Scouting_*root files created ? I suspect they were split further then what files coming from HLT are ; my 2 cents

yes, there's a detailed description here.

@vlimant
Copy link
Contributor

vlimant commented Sep 5, 2024

If I read this correctly

process.out = cms.OutputModule("PoolOutputModule",
the split level is the default (= 9) of the output module ; would explain why the test succeeds there, but not in the recipe above.
the read test should be done, IMO, from an actual file that was produced by the HLT. I'm confused why it's not the case, but there must be a reason.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

the split level is the default (= 9)

somewhat irrelevant, but I think the default is 99 (not 9):

desc.addUntracked<int>("splitLevel", 99)->setComment("Default ROOT branch split level in output file.");

I'm confused why it's not the case, but there must be a reason.

the test was created by the core framework group in #41834 (tagging @wddgit)

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

assign core, hlt

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 5, 2024

New categories assigned: core,hlt

@Dr15Jones,@Martin-Grunewald,@mmusich,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

I'm confused why it's not the case, but there must be a reason.

the main advantage I see, is that you can fully control the values you put in input (and check that you get exactly those back when reading)

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

the read test should be done, IMO, from an actual file that was produced by the HLT.

moreover the actual HLT is using the default splitLevel, as the scouting dataset is configured as

process.hltOutputScoutingPF = cms.OutputModule( "PoolOutputModule",
    fileName = cms.untracked.string( "outputScoutingPF.root" ),
    compressionAlgorithm = cms.untracked.string( "ZLIB" ),
    compressionLevel = cms.untracked.int32( 1 ),
    fastCloning = cms.untracked.bool( False ),
    dataset = cms.untracked.PSet(
        filterName = cms.untracked.string( "" ),
        dataTier = cms.untracked.string( "RAW" )
    ),
    SelectEvents = cms.untracked.PSet(  SelectEvents = cms.vstring( 'Dataset_ScoutingPFRun3' ) ),
    outputCommands = cms.untracked.vstring( 'drop *',
      'keep *_hltFEDSelectorL1_*_*',
      'keep *_hltScoutingEgammaPacker_*_*',
      'keep *_hltScoutingMuonPacker_*_*',
      'keep *_hltScoutingPFPacker_*_*',
      'keep *_hltScoutingPrimaryVertexPacker_*_*',
      'keep *_hltScoutingTrackPacker_*_*',
      'keep edmTriggerResults_*_*_*' )
)

splitLevel = cms.untracked.int32(0) is coming from EventContent_cff:

FEVTDEBUGHLTEventContent = cms.PSet(
outputCommands = cms.untracked.vstring('drop *'),
splitLevel = cms.untracked.int32(0),
)

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

I'm confused why it's not the case, but there must be a reason.

the main advantage I see, is that you can fully control the values you put in input (and check that you get exactly those back when reading)

This is indeed the case. In order to test all the nuances of the ROOT I/O, the inputs are better to be fully controlled rather than coming from the data (or simulation).

The point on the behavior depending on the split level is good though (given we have seen the schema evolution behavior to depend on it), perhaps we need to think of testing both splitlevel 0 and 99 in all of these tests.

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

----- Begin Fatal Exception 05-Sep-2024 12:47:53 CEST-----------------------
An exception of category 'ConditionsError' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLT_PPSMaxTracksPerRP4_v2'
   [2] Calling method for module L1TGlobalProducer/'hltGtStage2ObjectMap'
Exception Message:
 Error L1 menu loaded in via conditions does not match the L1 actually run 1517097079 vs 2016981387. This means that the mapping of the names to the bits may be incorrect. Please check the L1TUtmTriggerMenuRcd record supplied. Unless you know what you are doing, do not simply disable this check via the config as this a major error and the indication of something very wrong
----- End Fatal Exception -------------------------------------------------

and therefore the subsequent steps have never been tested (until now).

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

the reasons of these failures are explained at https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/15#note_8410470 and are entirely ascribable to inconsistent settings of the input Global Tags (namely the tag associated to L1TUtmTriggerMenuRcd) by PdmV / AlCa

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

the reasons of these failures are explained at https://gitlab.cern.ch/cms-ppd/dataset-management/dm-coordination/-/issues/15#note_8410470 and are entirely ascribable to inconsistent settings of the input Global Tags (namely the tag associated to L1TUtmTriggerMenuRcd) by PdmV / AlCa

But I guess those inconsistencies have been resolved somewhere given the recipe in the description?

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

I guess the problem is #41348 and can be fixed by including --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files to cmsDriver.py.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

But I guess those inconsistencies have been resolved somewhere given the recipe in the description?

The recipe in description works, because of

cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era Run3 --filein  file:step1.root  --fileout file:step2.root

i.e. L1,DIGI2RAW and HLT are running in the same step with the same GlobalTag (which I would not call a solution).

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

I tested the recipe in the description, and can confirm adding the --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files fixes the problem.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

indeed, just for the record this [1] works fine.

[1]

Click me
wrkdir=${PWD}/test_${RANDOM}

mkdir $wrkdir

release=CMSSW_12_4_0
if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py SingleMuPt1_pythia8_cfi  -s GEN,SIM -n 10 --conditions 124X_mcRun3_2022_realistic_v5 --beamspot Run3RoundOptics25ns13TeVLowSigmaZ --datatier GEN-SIM --eventcontent FEVTDEBUG --geometry DB:Ex
tended --era Run3 --relval 25000,100 --fileout file:step1.root

cmsDriver.py step2  -s DIGI:pdigi_valid,L1,DIGI2RAW,HLT:@relval2021 --conditions 124X_mcRun3_2022_realistic_v5 --datatier GEN-SIM-DIGI-RAW -n 10 --eventcontent FEVTDEBUGHLT --geometry DB:Extended --era R
un3 --filein  file:step1.root  --fileout file:step2.root 

release=CMSSW_14_0_0

if [ ! -r ${release}/src ] ;
   then scram p CMSSW ${release} ;
fi

cd ${release}/src
eval `scram runtime -sh`
cd $wrkdir

cmsDriver.py step3  -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,NANO,VALIDATION:@standardValidation+@miniAODValidation,DQM:@standardDQM+@ExtraHLT+@miniAODDQM+@nanoAODDQM --conditions 140X_mcRun3_2022_realistic_v
5 --datatier GEN-SIM-RECO,MINIAODSIM,NANOAODSIM,DQMIO -n 10 --eventcontent RECOSIM,MINIAODSIM,NANOEDMAODSIM,DQM --geometry DB:Extended --era Run3 --filein  file:step2.root  --fileout file:step3.root --customise IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files

@mmusich
Copy link
Contributor Author

mmusich commented Sep 5, 2024

re-iterating on this:

I guess the problem wasn't caught by the test_MC_22_setup test, because the HLT step there still fails with

This patch

diff --git a/Configuration/AlCa/python/autoCond.py b/Configuration/AlCa/python/autoCond.py
index 01e4c09ddac..96ef7ec02a1 100644
--- a/Configuration/AlCa/python/autoCond.py
+++ b/Configuration/AlCa/python/autoCond.py
@@ -115,6 +115,9 @@ autoCond = autoCondRelValForRun2(autoCond)
 from Configuration.AlCa.autoCondModifiers import autoCondRelValForRun3
 autoCond = autoCondRelValForRun3(autoCond)
 
+from Configuration.AlCa.autoCondModifiers import autoCond2022MCTest
+autoCond = autoCond2022MCTest(autoCond)
+
 # GlobalTag for Run1 data reprocessing, history was carried over to run2 GTs
 autoCond['run1_data']        = autoCond['run2_data']
 
diff --git a/Configuration/AlCa/python/autoCondModifiers.py b/Configuration/AlCa/python/autoCondModifiers.py
index 7523b7b9943..89eaf320bdd 100644
--- a/Configuration/AlCa/python/autoCondModifiers.py
+++ b/Configuration/AlCa/python/autoCondModifiers.py
@@ -142,3 +142,14 @@ def autoCondBSHLLHC13TeV(autoCond):
 
     autoCond.update(GlobalTagBSHLLHC13TeV)
     return autoCond
+
+def autoCond2022MCTest(autoCond):
+
+    GlobalTagFor2022MC = {}
+    L1TUtmTriggerMenuFor2022v12 = ','.join( ['L1Menu_Collisions2022_v1_2_0_xml' , "L1TUtmTriggerMenuRcd", connectionString, "", ""] )
+    for key,val in autoCond.items():
+        if 'phase1_2022_realistic' in key :
+            GlobalTagFor2022MC[key+'_2022v12'] = (autoCond[key],
+                                                  L1TUtmTriggerMenuFor2022v12)
+    autoCond.update(GlobalTagFor2022MC)
+    return autoCond
diff --git a/Configuration/PyReleaseValidation/test/BuildFile.xml b/Configuration/PyReleaseValidation/test/BuildFile.xml
index 23681eb2d06..8abf9ff7ca9 100644
--- a/Configuration/PyReleaseValidation/test/BuildFile.xml
+++ b/Configuration/PyReleaseValidation/test/BuildFile.xml
@@ -22,7 +22,7 @@
 <!-- In CMSSW_12_4_20 the auto:phase1_2022_realistic (pre EE) is 124X_mcRun3_2022_realistic_v12 -->
 <ifarchitecture name="el8_amd64">
   <!-- Run this test for production arch of CMSSW_12_4_21_HLT release only -->
-  <test name="test_MC_22_setup" command="test_mc_setup/test_MC_setup.sh auto:phase1_2022_realistic Run3 2022v14 CMSSW_12_4_21_HLT 124X_mcRun3_2022_realistic_v12 Realistic25ns13p6TeVEarly2022Collision" />
+  <test name="test_MC_22_setup" command="test_mc_setup/test_MC_setup.sh auto:phase1_2022_realistic_2022v12 Run3 2022v12 CMSSW_12_4_21_HLT 124X_mcRun3_2022_realistic_v12 Realistic25ns13p6TeVEarly2022Collision" />
 </ifarchitecture>
 
 <!-- 
diff --git a/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh b/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
index 1b37bee2e89..95e44d735b7 100755
--- a/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
+++ b/Configuration/PyReleaseValidation/test/test_mc_setup/test_MC_setup_gen_sim.sh
@@ -31,6 +31,7 @@ cmsDriver.py  --python_filename raw_digi.py --eventcontent RAWSIM \
 --customise Configuration/DataProcessing/Utils.addMonitoring \
 --datatier GEN-SIM-RAW --filein file:step1.root \
 --fileout file:step2.root --conditions $conditions --step DIGI,L1,DIGI2RAW \
+--customise_commands 'process.GlobalTag.DumpStat = cms.untracked.bool(True)' \
 --geometry DB:Extended --era $era --mc -n -1 --no_exec
 
 if [ $? -ne 0 ]; then

exposes the problem. I hesitate to make a PR to "fix" the test as at the moment we're still discussing within TSG / PPD if not to move to yet another menu (v1.5) for re-HLT in 2022 MC. @cms-sw/pdmv-l2 @cms-sw/alca-l2

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

There are dedicated unit tests that ensure we can continue to be able reading in newer releases HLT scouting files produced in 12_4_0 and 13_0_3, see here and here for more details.

I tested re-producing the 12_4_0 test file with splitlevel set to 0, and with that input the TestRun3ScoutingDataFormats unit test indeed fails with similar symptoms, and the failure gets fixed with the IOPool/Input/fixReading_12_4_X_Files.fixReading_12_4_X_Files customize function.

The point on the behavior depending on the split level is good though (given we have seen the schema evolution behavior to depend on it), perhaps we need to think of testing both splitlevel 0 and 99 in all of these tests.

I think we really need to do this, and am spinning off that work into a separate issue #45931.

@makortel
Copy link
Contributor

makortel commented Sep 5, 2024

+core

@AdrianoDee
Copy link
Contributor

I hesitate to make a PR to "fix" the test as at the moment we're still discussing within TSG / PPD if not nto move to yet another menu (v1.5) for re-HLT in 2022 MC.

Thanks @mmusich, I was waiting too the final decision on the setup to decide how to fix the tests.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 6, 2024

I was waiting too the final decision on the setup to decide how to fix the tests.

It was ultimately decided to have the option of using V1.5 instead: thus I opened #45939. If this is what we'll actually use, all tests in open cycles will need to be updated once we have a release containing it.

@mmusich
Copy link
Contributor Author

mmusich commented Oct 18, 2024

was waiting too the final decision on the setup to decide how to fix the tests.

Now that the situation seems to be relatively stabilized (albeit I am not yet sure the GTs are fully final) I opened:

@mmusich
Copy link
Contributor Author

mmusich commented Oct 25, 2024

+hlt

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@makortel @cmsbuild @vlimant @mmusich @AdrianoDee and others