-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProductNotFound issue in T0 HLTMonitor processing jobs starting Run 378985 #44643
Comments
cms-bot internal usage |
A new Issue was created by @saumyaphor4252. @makortel, @smuzaffar, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign dqm |
New categories assigned: dqm @rvenditti,@syuvivida,@tjavaid,@nothingface0,@antoniovagnerini you have been requested to review this Pull request/Issue and eventually sign? Thanks |
just for the record, the
thus the |
We copied a few LSs from run 378981 (LS=455-465) in the playback region which were affected and reproduced this crash at playback, here is it's log file. The information about how to reproduce the crash at lxplus can be found here and the streamers have been copied at this path : |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Dear all, the issue is now becoming a showstopper for T0 operations with 100s of paused jobs in the HLTMonitor stream and Prompt Reco processing now stopped starting Run 378981: https://cms-talk.web.cern.ch/t/promptreco-is-paused-for-run2024b/38673. Can the experts please look into it with a high priority. Thanks and regards, |
Starting now. |
Thanks Matti. Note that from the Joint Ops meeting DQM reported that the track extra collection is really not in the file output from the HLT. So the process that really needs to be debugged is the HLT executable to determine why in a small fraction of the events the HLT does not manage to write out the TrackExtras. |
Looking at this reproducer, the |
Instrumenting the In the previous event ( |
@cms-sw/hlt-l2 Is it possible to find out after the fact if the events |
type tracking |
Maybe the EventAuxiliary processGUID? |
A recipe to rerun the HLT menu used to process these (kind of) events would also help in the investigation, even if it would not reproduce this problem exactly. |
Did anything change e.g. in HLT menu or how HLT is being run in DAQ in Run 378985? |
Here's a possible recipe (not guaranteed to reproduce): #!/bin/bash -ex
# CMSSW_14_0_4
hltGetConfiguration run:378985 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--output full \
--max-events -1 \
--input /store/data/Run2024B/EphemeralZeroBias0/RAW/v1/000/378/985/00000/f5f542ca-b93e-46e9-a136-7e9f1740218a.root \
> hlt.py
cat <<@EOF >> hlt.py
process.hltOutputFull.outputCommands = [
'keep *',
'drop *_hltSiPixelDigisLegacy_*_*',
'drop *_hltSiPixelClustersLegacy_*_*',
'drop *_hltSiPixelRecHitsFromLegacy_*_*',
'drop *_hltEcalDigisLegacy_*_*',
'drop *_hltEcalUncalibRecHitLegacy_*_*',
'drop *_hltHbherecoLegacy_*_*',
]
@EOF
cmsRun hlt.py &> hlt.log (the customization of |
Run 378985 was the second run with SB @ 13.6 TeV, however, the menu did not change wrt. the first stable collisions run 378981. The menu deployed for the first 13.6 TeV collisions (for both runs) is the full p-p physics menu (V1.0) with L1T seeds and HLT paths enabled: /cdaq/physics/Run2024/2e34/v1.0.3/HLT/V2 |
Hello, This was fixed from 379067 with full reset of one GPU on the host (also, possibly, shorter periods on Saturday didn't have this problem when we attempted another fix but the problem returned quickly). |
There is a correlation it seems. |
OK, noted, I take it back. It was my naive assumption that we will keep homogenous setup in Run 3, but, if that is not the case, we need a long term solution. |
I made estimate of additional bandwidth for option 1 on the current HLT farm. After removing PSetMap writing in Taking into account we run about 200 FUs with 8 processes each in current configuration (this will increase by ~20% with new FUs), if we had one process block per stream file of a single process (we write one such file every lumisection), there would be 1600 such files written per stream per lumisection, or almost 69 Hz per stream. 69 Hz *70 * 0.034 MB => 164 MB/s. Probably around 200 MB/s with new FUs. |
Thanks for the feedback. With @Dr15Jones and @wddgit we started to look into the details of "option 2" (and we will continue on this route unless decided otherwise, or we hit a blocker). We came to the conclusion that the Adler32 checksum stored in the
would do the necessary job to disambiguate the different Init messages at the granularity needed for the framework metadata. (for reference, that checksum corresponds the serialized data of SendJobHeader , that includescmssw/IOPool/Streamer/src/StreamSerializer.cc Lines 67 to 69 in e427713
) We have two questions to DAQ at this stage
|
This will have to be discussed more widely in the DAQ group since one part of the merging chain (all after the FU) is not maintained by me and we should agree on the approach (and feasibility).
Yes, INI chunks would definitively come before any events, as now. |
This workaround is in |
Ok. Can you tell anything about the timescale for a decision? I guess if this approach would be deemed unfeasible in the end, we'd go with the option 1 ("process-level metadata section in the event data files").
Ok. I think also this should be fine from the framework perspective. |
Typically we discuss such topics at our weekly meetings (Monday morning). |
assign daq |
I discussed this with @GuillelmoGomezCeballos and another complication came up. Writing on a particular RUBU happens only when lumi is closed locally, before all other RUBUs have closed that lumisection, in fact it can even happen before all FUs provide their INI files because the system is asynchronous in the sense of participating nodes. So, there is a chance to miss an INI file, and then still end up with events from the corresponding process written into the streamer file, which would lead to the problem we had over the last few days. While we could still consider (contrary to what I wrote before) writing INI files in the middle of streamer files, this would complicate synchronization between writing processes and we would like to avoid that. Therefore we prefer to have option 1. |
Can't the merger prepend the necessary INI files to the sparse file once all the data has been accounted for ? |
What if you added the INI file info at the end of the streamer file? If the last 4 bytes of the file give the size the INI takes up at the end of the file we could read the last bytes, use that to jump to the INI section, read that and the start reading the Event data from the front of the file. In that way you do not have to wait for all INIs before starting to write the Event data. |
I'm not an expert on this, but I think that, while the sparse file can have gaps, the start of (any kind of) file can't be moved and it can be truncated only at the end side.
That seems interesting, e.g. for the macro-merging stage (which currently now deals with bookkeeping only). |
From what I see in the long thread, there is already an option which involves not making changes downstream. Given all the trouble we are having elsewhere, and the fact other changes are likely to happen, I don't see how to make things more complicated. Therefore, let's use the so-called option 1 and avoiding adding further files and changes involved the file system. |
This is not a technical evaluation of the different solutions, it's just an "I'm busy, let somebody else do the work" argument. |
@smorovic About the bandwidth addition estimate of the "option 1" (#44643 (comment)), how would the picture change for the heavy ion data taking? Would there be e.g. less streams (output files) per job per lumi compared to pp? |
I checked one of the HI menus in 2023, So, even with more streams, it is expected to add 2 and 3 times less overhead than pp ( less than 60 MB/s). |
+heterogeneous There is nothing left related to For |
Reporting the T0 processing error in jobs for HLTMonitor stream for Run 378985 detailed in
https://cms-talk.web.cern.ch/t/express-paused-jobs-run2024b-productnotfound-error/38544
The tarball for the PSet is available at
Somewhat similar symptoms also seem to be for Run378993 with error
FYI @cms-sw/dqm-l2 May also be relevant to the online hlt DQM client crashes starting Run 378981 reported at the DRM today:
https://cmsweb.cern.ch/dqm/dqm-square/api?what=get_logs&id=dqm-source-state-run378981-hostdqmfu-c2b03-45-01-pid3057526&db=production
The text was updated successfully, but these errors were encountered: