ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

saumyaphor4252 · 2024-04-06T10:31:58Z

Reporting the T0 processing error in jobs for HLTMonitor stream for Run 378985 detailed in
https://cms-talk.web.cern.ch/t/express-paused-jobs-run2024b-productnotfound-error/38544

CRITICAL:root:Error running cmsRun
{‘arguments’: [‘/bin/bash’, ‘/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh’, ‘’, ‘el8_amd64_gcc12’, ‘scramv1’, ‘CMSSW’, ‘CMSSW_14_0_4’, ‘FrameworkJobReport.xml’, ‘cmsRun’, ‘PSet.py’, ’
']}
CMSSW Return code: 8006

CRITICAL:root:Error message: An exception of category ‘ProductNotFound’ occurred while
[0] Processing Event run: 378985 lumi: 224 event: 220047314 stream: 6
[1] Running path ‘dqmoffline_step’
[2] Calling method for module TrackRefitter/‘hltrefittedForPixelDQM’
Exception Message:
RefCore: A request to resolve a reference to a product of type ‘std::vectorreco::TrackExtra’ with ProductID ‘1:3256’
can not be satisfied because the product cannot be found.
Probably the branch containing the product is not stored in the input file.
Additional Info:
[a] If you wish to continue processing events after a ProductNotFound exception,
add “TryToContinue = cms.untracked.vstring(‘ProductNotFound’)” to the “options” PSet in the configuration.

The tarball for the PSet is available at

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024B/ProductNotFound/vocms014.cern.ch-27238-3-log.tar.gz

Somewhat similar symptoms also seem to be for Run378993 with error

type: "Fatal Exception"
details: "An exception of category 'ProductNotFound' occurred while 
   [0] Processing Event run: 378993 lumi: 188 event: 223221502 stream: 1  
   [1] Running path 'dqmoffline_step' 
   [2] Calling method for module TrackingMonitor/'gsfTracksMonitoringHLT' 
Exception Message: 
RefCore: A request to resolve a reference to a product of type 'std::vector<reco::TrackExtra>' with ProductID '1:3118' can not be satisfied because the product cannot be found. 
Probably the branch containing the product is not stored in the input file. 
  Additional Info: 
        [a] If you wish to continue processing events after a ProductNotFound exception, add "TryToContinue = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration."
exitCode: 8006

FYI @cms-sw/dqm-l2 May also be relevant to the online hlt DQM client crashes starting Run 378981 reported at the DRM today:
https://cmsweb.cern.ch/dqm/dqm-square/api?what=get_logs&id=dqm-source-state-run378981-hostdqmfu-c2b03-45-01-pid3057526&db=production

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-04-06T10:32:21Z

cms-bot internal usage

cmsbuild · 2024-04-06T10:32:22Z

A new Issue was created by @saumyaphor4252.

@makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

perrotta · 2024-04-06T11:14:40Z

assign dqm

cmsbuild · 2024-04-06T11:15:02Z

New categories assigned: dqm

@rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2024-04-06T11:56:28Z

just for the record, the PoolOutputModule of the HLTMonitor stream is configured such that it sends the right event content to be consumed at Tier-0:

      'keep *_hltEgammaGsfTracks_*_*',
      'keep *_hltMergedTracksForBTag_*_*',
      'keep *_hltMergedTracks_*_*',
      'keep *_hltPixelTracks_*_*',

thus the recoTrackExtras_*__HLT branches should be present in the input streamer files.
I suspect something along the lines of #39064.

Pranjal033 · 2024-04-06T13:31:55Z

We copied a few LSs from run 378981 (LS=455-465) in the playback region which were affected and reproduced this crash at playback, here is it's log file.

The information about how to reproduce the crash at lxplus can be found here and the streamers have been copied at this path :
/eos/cms/store/group/comm_dqm/Collisions24_tempStreamers/

syuvivida · 2024-04-06T13:45:58Z

assign core

cmsbuild · 2024-04-06T13:46:22Z

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

saumyaphor4252 · 2024-04-08T10:21:02Z

Dear all, the issue is now becoming a showstopper for T0 operations with 100s of paused jobs in the HLTMonitor stream and Prompt Reco processing now stopped starting Run 378981: https://cms-talk.web.cern.ch/t/promptreco-is-paused-for-run2024b/38673.

Can the experts please look into it with a high priority.

Thanks and regards,
Saumya (incoming ORM)

makortel · 2024-04-08T13:00:30Z

Can the experts please look into it with a high priority.

Starting now.

sextonkennedy · 2024-04-08T13:11:44Z

Thanks Matti. Note that from the Joint Ops meeting DQM reported that the track extra collection is really not in the file output from the HLT. So the process that really needs to be debugged is the HLT executable to determine why in a small fraction of the events the HLT does not manage to write out the TrackExtras.

makortel · 2024-04-08T13:19:52Z

The tarball for the PSet is available at

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024B/ProductNotFound/vocms014.cern.ch-27238-3-log.tar.gz

Looking at this reproducer, the hltrefittedForPixelDQM that throws the exception, consumes hltMergedTracks. According to EventContentAnalyzer the std::vector<TrackExtra> from hltMergedTracks is there, but with ProductID 1:3264, whereas the exception complains the std::vector<TrackExtra> with ProductID 1:3256 being missing.

makortel · 2024-04-08T16:24:04Z

Instrumenting the TrackProducerAlgorithm<T>::runWithTrack() for the case that throws the exception I see in the event 378985:224:220047314 the module's input (hltMergedTracks) has total of 326 tracks. First 316 tracks have algo hltIter0 and the last 10 have undefAlgorithm. For all tracks the TrackExtraRef has ProductID of 1:3256.

In the previous event (378985:224:219871457) in the file all the TrackExtraRefs have (the correct) ProductID 1:3264.

makortel · 2024-04-08T16:34:06Z

@cms-sw/hlt-l2 Is it possible to find out after the fact if the events 378985:224:219871457 and 378985:224:220047314 were processed by the same cmsRun process or by two different processes (either same or different node)?

slava77 · 2024-04-08T16:38:35Z

type tracking

dan131riley · 2024-04-08T16:41:43Z

Maybe the EventAuxiliary processGUID?

makortel · 2024-04-08T17:06:56Z

A recipe to rerun the HLT menu used to process these (kind of) events would also help in the investigation, even if it would not reproduce this problem exactly.

makortel · 2024-04-08T17:33:55Z

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

mmusich · 2024-04-08T17:42:01Z

A recipe to rerun the HLT menu used to process these (kind of) events would also help in the investigation, even if it would not reproduce this problem exactly.

Here's a possible recipe (not guaranteed to reproduce):

#!/bin/bash -ex

# CMSSW_14_0_4

hltGetConfiguration run:378985  \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --output full \
  --max-events -1 \
  --input /store/data/Run2024B/EphemeralZeroBias0/RAW/v1/000/378/985/00000/f5f542ca-b93e-46e9-a136-7e9f1740218a.root \
  > hlt.py

cat <<@EOF >> hlt.py
process.hltOutputFull.outputCommands = [
    'keep *',
    'drop *_hltSiPixelDigisLegacy_*_*',
    'drop *_hltSiPixelClustersLegacy_*_*',
    'drop *_hltSiPixelRecHitsFromLegacy_*_*',
    'drop *_hltEcalDigisLegacy_*_*',
    'drop *_hltEcalUncalibRecHitLegacy_*_*',
    'drop *_hltHbherecoLegacy_*_*',
    ]
@EOF

cmsRun hlt.py &> hlt.log

(the customization of outputCommands is needed because #37207).

mmusich · 2024-04-08T17:45:35Z

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

adding to the ticket @mzarucki @trtomei @trocino @smorovic

mzarucki · 2024-04-08T18:16:07Z

Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985?

Run 378985 was the second run with SB @ 13.6 TeV, however, the menu did not change wrt. the first stable collisions run 378981.

The menu deployed for the first 13.6 TeV collisions (for both runs) is the full p-p physics menu (V1.0) with L1T seeds and HLT paths enabled: /cdaq/physics/Run2024/2e34/v1.0.3/HLT/V2

smorovic · 2024-04-08T19:54:00Z

Hello,
As far as I know, there was nothing different, globally, from DAQ side.
One problem in this range is starting from 378981 until Sunday evening when we had a single FU with no GPUs detected by CMSSW and thus running with CPU only, if that could've had any effect.

This was fixed from 379067 with full reset of one GPU on the host (also, possibly, shorter periods on Saturday didn't have this problem when we attempted another fix but the problem returned quickly).

mmusich · 2024-04-08T20:02:03Z

One problem in this range is starting from 378981 until Sunday evening when we had a single FU with no GPUs detected by CMSSW and thus running with CPU only, if that could've had any effect.

There is a correlation it seems.
Judging from https://dmytro.web.cern.ch/dmytro/cmsprodmon/tier0.php the last run with this kind of failure was 379058, so it's consistent with the change occurring at 379067.

smorovic · 2024-04-10T12:29:42Z

More thinking about it, with the CPU fallback disabled I don't see a need for a workaround (options 1-2...).
For HLT we can assume that we will run the whole cluster either with GPUs or without GPUs (at least in Run 3).

Sorry, who decided that this is how we will be running for the rest of Run 3 ?

I understood the "fix" as a quick'n'dirty hack to prevent the problem with Tier-0 jobs while a better solution is being investigated.

Not as a permanent solution for the rest of Run 3.

OK, noted, I take it back. It was my naive assumption that we will keep homogenous setup in Run 3, but, if that is not the case, we need a long term solution.

smorovic · 2024-04-10T15:04:33Z

I made estimate of additional bandwidth for option 1 on the current HLT farm.

After removing PSetMap writing in IOPool/Streamer/src/StreamSerializer.cc, size of uncompressed INI files among streams with the HLT menu of run 378981 is much lower, between 54K and 66K (largest one being HLTMonitor). This seems to come from branch ID lists (sd.setBranchIDLists(branchIDLists);) while thinnedAssociationsHelper doesn't have any effect on size.
Compressing the HLTMonitor file with tar cjf, I'm getting around 34K.

Taking into account we run about 200 FUs with 8 processes each in current configuration (this will increase by ~20% with new FUs), if we had one process block per stream file of a single process (we write one such file every lumisection), there would be 1600 such files written per stream per lumisection, or almost 69 Hz per stream.
Let's assume 70 streams each lumisection, which is less than we have now in ramp up. We are probably running with less streams overall. Note: if streams have no events in a lumisection, process block could also be skipped.

69 Hz *70 * 0.034 MB => 164 MB/s. Probably around 200 MB/s with new FUs.
I think this looks acceptable compared to 10 to 12 GB/s expected this year for pp runs.

makortel · 2024-04-10T16:20:28Z

Thanks for the feedback. With @Dr15Jones and @wddgit we started to look into the details of "option 2" (and we will continue on this route unless decided otherwise, or we hit a blocker).

We came to the conclusion that the Adler32 checksum stored in the InitMsgView

cmssw/IOPool/Streamer/interface/InitMessage.h

Line 89 in 1471374

uint32 adler32_chksum() const { return adler32_chksum_; }

would do the necessary job to disambiguate the different Init messages at the granularity needed for the framework metadata. (for reference, that checksum corresponds the serialized data of SendJobHeader, that includes

cmssw/IOPool/Streamer/src/StreamSerializer.cc

Lines 67 to 69 in e427713

    
           sd.setBranchIDLists(branchIDLists); 
        
           sd.setThinnedAssociationsHelper(thinnedAssociationsHelper); 
        
           sd.setParameterSetMap(psetMap);

)

We have two questions to DAQ at this stage

(more out of curiosity) How would the merging identify which INI files have different content (checksum)? Would it e.g. rely on file naming (in which case the aforementioned checksum would have to be included in the final INI file name), or read the checksum from the INI file, or calculate the checksum of the full INI file on the fly?
Do you have preference how the final unique INI files and the event data files are concatenated together? Would it be reasonable to require all the InitMessages (INI files) first, followed by all the event data?

smorovic · 2024-04-10T18:05:49Z

)
We have two questions to DAQ at this stage

(more out of curiosity) How would the merging identify which INI files have different content (checksum)? Would it e.g. rely on file naming (in which case the aforementioned checksum would have to be included in the final INI file name), or read the checksum from the INI file, or calculate the checksum of the full INI file on the fly?

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).
But, in my opinion, the most straightforward way would be to have a full INI checksum (not adler32 of the payload) calculated on FUs, then and appended to the file name to disambiguate and preserve each variant through the merging chain.
We now use files of the format runX_ls0000_streamY_pidZ.ini. Later, as it is propagated, it becomes, for example, runX_ls0000_streamY_fu-c2b02-41-01.ini and similar. Another substring, like _checksum98798787 could be added to the file.

Do you have preference how the final unique INI files and the event data files are concatenated together? Would it be reasonable to require all the InitMessages (INI files) first, followed by all the event data?

Yes, INI chunks would definitively come before any events, as now.

makortel · 2024-04-10T20:26:55Z

I believe I can cook up a quick (and ugly) workaround addressing point 2, for the specific case of the present HLT menu and farm. This workaround is not a good solution for longer term, but would help us to move forward while we work on a good long-term solution.

This workaround is in

master: Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698
14_0_X: [14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

makortel · 2024-04-10T20:38:44Z

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).

Ok. Can you tell anything about the timescale for a decision? I guess if this approach would be deemed unfeasible in the end, we'd go with the option 1 ("process-level metadata section in the event data files").

But, in my opinion, the most straightforward way would be to have a full INI checksum (not adler32 of the payload) calculated on FUs, then and appended to the file name to disambiguate and preserve each variant through the merging chain.

Ok. I think also this should be fine from the framework perspective.

smorovic · 2024-04-10T21:22:20Z

This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).

Ok. Can you tell anything about the timescale for a decision? I guess if this approach would be deemed unfeasible in the end, we'd go with the option 1 ("process-level metadata section in the event data files").

Typically we discuss such topics at our weekly meetings (Monday morning).
Let me tag @emeschi and @GuillelmoGomezCeballos
They should be aware of this discussion and could give their opinion on discussed ideas/options, particularly on option "2" which assumes changes in merging of INI files.

makortel · 2024-04-10T21:54:21Z

assign daq

cmsbuild · 2024-04-10T21:54:41Z

New categories assigned: daq

@emeschi,@smorovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

smorovic · 2024-04-11T13:57:52Z

I discussed this with @GuillelmoGomezCeballos and another complication came up.
We are doing distributed merging into single files, in a global filesystem before they get transferred. This happens at the second step of merging, after the FU. When first RUBU (out of 60 machines) sees a stream for a lumi fully processed locally, it takes a file lock in Lustre and creates a large sparse file, writes INI header to it and reserves space to write event data (that reservation is written into the lock file before it's unlocked).
Subsequent RUBUs also take a lock to reserve space for event data for themselves, then unlock and write what they have.
After event count written checks out globally with the input from DAQ event-building, the file is truncated and transfer to T0 is initiated.
In this way we achieve having a single large file on disk without reading and writing again to the filesystem, otherwise we would have to use double bandwidth from/to the global file system (which is already at limits like this).

Writing on a particular RUBU happens only when lumi is closed locally, before all other RUBUs have closed that lumisection, in fact it can even happen before all FUs provide their INI files because the system is asynchronous in the sense of participating nodes.

So, there is a chance to miss an INI file, and then still end up with events from the corresponding process written into the streamer file, which would lead to the problem we had over the last few days.

While we could still consider (contrary to what I wrote before) writing INI files in the middle of streamer files, this would complicate synchronization between writing processes and we would like to avoid that.

Therefore we prefer to have option 1.

fwyzard · 2024-04-11T14:03:19Z

Can't the merger prepend the necessary INI files to the sparse file once all the data has been accounted for ?

Dr15Jones · 2024-04-11T14:11:03Z

What if you added the INI file info at the end of the streamer file? If the last 4 bytes of the file give the size the INI takes up at the end of the file we could read the last bytes, use that to jump to the INI section, read that and the start reading the Event data from the front of the file. In that way you do not have to wait for all INIs before starting to write the Event data.

smorovic · 2024-04-11T16:10:51Z

Can't the merger prepend the necessary INI files to the sparse file once all the data has been accounted for ?

I'm not an expert on this, but I think that, while the sparse file can have gaps, the start of (any kind of) file can't be moved and it can be truncated only at the end side.

What if you added the INI file info at the end of the streamer file? If the last 4 bytes of the file give the size the INI takes up at the end of the file we could read the last bytes, use that to jump to the INI section, read that and the start reading the Event data from the front of the file. In that way you do not have to wait for all INIs before starting to write the Event data.

That seems interesting, e.g. for the macro-merging stage (which currently now deals with bookkeeping only).
I'll let @GuillelmoGomezCeballos comment on the idea.

GuillelmoGomezCeballos · 2024-04-12T07:18:03Z

From what I see in the long thread, there is already an option which involves not making changes downstream. Given all the trouble we are having elsewhere, and the fact other changes are likely to happen, I don't see how to make things more complicated. Therefore, let's use the so-called option 1 and avoiding adding further files and changes involved the file system.

fwyzard · 2024-04-12T07:43:50Z

From what I see in the long thread, there is already an option which involves not making changes downstream. Given all the trouble we are having elsewhere, and the fact other changes are likely to happen, I don't see how to make things more complicated. Therefore, let's use the so-called option 1 and avoiding adding further files and changes involved the file system.

This is not a technical evaluation of the different solutions, it's just an "I'm busy, let somebody else do the work" argument.

makortel · 2024-04-16T20:21:45Z

@smorovic About the bandwidth addition estimate of the "option 1" (#44643 (comment)), how would the picture change for the heavy ion data taking? Would there be e.g. less streams (output files) per job per lumi compared to pp?

smorovic · 2024-04-16T21:43:32Z

I checked one of the HI menus in 2023, /cdaq/physics/Run2023HI/v1.1.6/HLT/V8. It has 77 event data streams (mostly to optimally split file size). INI sizes are ~2.3 MB.
After removing writeout of pset, most INI sizes are 17-18kB, and three streams have 21-23 kB.
Taking the largest one (23k), it compresses to 11 kB, while 18kB ones compresse to 9.3kB

So, even with more streams, it is expected to add 2 and 3 times less overhead than pp ( less than 60 MB/s).
The number of streams is expected to stay similar as in 2023.

makortel · 2024-08-15T18:28:06Z

+heterogeneous

There is nothing left related to heterogeneous.

For core we had some ideas to improve reading performance of streamer files that were concatenated from processes with differing ProductRegistry (cms-sw/framework-team#931), but presently we are not planning to address those in the near term, and I'm not sure if this issue would be useful to remind those. So maybe this issue could be closed now?

cmsbuild added the pending-assignment label Apr 6, 2024

cmsbuild added dqm-pending pending-signatures and removed pending-assignment labels Apr 6, 2024

cmsbuild added the core-pending label Apr 6, 2024

syuvivida mentioned this issue Apr 8, 2024

[14_0_X] DQM: temporary fix of the crash of hlt DQM client using TryToContinue #44652

Closed

cmsbuild added the tracking label Apr 8, 2024

makortel mentioned this issue Apr 10, 2024

Investigate feasibility of streamer file/data format based fix cms-sw/framework-team#879

Closed

cmsbuild added the daq-pending label Apr 10, 2024

makortel mentioned this issue Apr 11, 2024

Allow BranchIDLists to be used and updated concurrently cms-sw/framework-team#882

Open

makortel mentioned this issue Apr 16, 2024

Add tests for ProductID metadata consistency for ROOT and streamer formats #44753

Merged

francescobrivio mentioned this issue May 2, 2024

Replay for CMSSW_14_0_6_patch1 dmwm/T0#4947

Merged

makortel mentioned this issue May 3, 2024

Handle event meta data changes in streamer files #44892

Merged

makortel mentioned this issue May 16, 2024

Handle event meta data changes in streamer files [14_0] #44978

Merged

missirol mentioned this issue Jul 15, 2024

improve testing of ECAL-laser-farm workflow(s) #45450

Closed

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Aug 15, 2024

ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643

Comments

saumyaphor4252 commented Apr 6, 2024

cmsbuild commented Apr 6, 2024 • edited Loading

cmsbuild commented Apr 6, 2024

perrotta commented Apr 6, 2024

cmsbuild commented Apr 6, 2024

mmusich commented Apr 6, 2024

Pranjal033 commented Apr 6, 2024

syuvivida commented Apr 6, 2024

cmsbuild commented Apr 6, 2024

saumyaphor4252 commented Apr 8, 2024

makortel commented Apr 8, 2024

sextonkennedy commented Apr 8, 2024

makortel commented Apr 8, 2024

makortel commented Apr 8, 2024

makortel commented Apr 8, 2024 • edited Loading

slava77 commented Apr 8, 2024

dan131riley commented Apr 8, 2024

makortel commented Apr 8, 2024

makortel commented Apr 8, 2024

mmusich commented Apr 8, 2024

mmusich commented Apr 8, 2024

mzarucki commented Apr 8, 2024 • edited Loading

smorovic commented Apr 8, 2024

mmusich commented Apr 8, 2024

smorovic commented Apr 10, 2024

smorovic commented Apr 10, 2024

makortel commented Apr 10, 2024

smorovic commented Apr 10, 2024

makortel commented Apr 10, 2024

makortel commented Apr 10, 2024

smorovic commented Apr 10, 2024

makortel commented Apr 10, 2024

cmsbuild commented Apr 10, 2024

smorovic commented Apr 11, 2024

fwyzard commented Apr 11, 2024

Dr15Jones commented Apr 11, 2024

smorovic commented Apr 11, 2024

GuillelmoGomezCeballos commented Apr 12, 2024

fwyzard commented Apr 12, 2024

makortel commented Apr 16, 2024

smorovic commented Apr 16, 2024

makortel commented Aug 15, 2024

cmsbuild commented Apr 6, 2024 •

edited

Loading

makortel commented Apr 8, 2024 •

edited

Loading

mzarucki commented Apr 8, 2024 •

edited

Loading