-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal Exception
in Prompt Reco of Run 367232, datatset JetMET0
#41645
Comments
assign dqm, l1 |
New categories assigned: dqm,l1 @tjavaid,@epalencia,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@aloeliger,@cecilecaillol,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @mmusich Marco Musich. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
with a bit of offline diagnosis, the crash occurs here: cmssw/DQM/L1TMonitor/src/L1TObjectsTiming.cc Line 819 in 9556ab9
|
This apparently happens because in that event ( |
There is now a second Prompt Reco paused job with same symptoms |
urgent |
Also the second crash is reproducible under CMSSW_13_0_5_patch1 on import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.options.numberOfThreads = 1
process.source.skipEvents=cms.untracked.uint32(680) (taking the rest of the configuration here) The exception message is: ----- Begin Fatal Exception 13-May-2023 18:36:50 CEST-----------------------
An exception of category 'StdException' occurred while
[0] Processing Event run: 367315 lumi: 40 event: 94344670 stream: 0
[1] Running path 'dqmoffline_1_step'
[2] Calling method for module L1TObjectsTiming/'l1tObjectsTiming'
Exception Message:
A std::exception was thrown.
vector::_M_range_check: __n (which is 18446744073709551605) >= this->size() (which is 5)
----- End Fatal Exception ------------------------------------------------- |
Technically this change circumvents the crashes, though it would be good if the L1T / DQM experts find the root cause. diff --git a/DQM/L1TMonitor/interface/L1TObjectsTiming.h b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
index d39f6a4fe96..42b3ff4616e 100644
--- a/DQM/L1TMonitor/interface/L1TObjectsTiming.h
+++ b/DQM/L1TMonitor/interface/L1TObjectsTiming.h
@@ -38,6 +38,8 @@ protected:
void dqmBeginRun(const edm::Run&, const edm::EventSetup&) override;
void bookHistograms(DQMStore::IBooker&, const edm::Run&, const edm::EventSetup&) override;
void analyze(const edm::Event&, const edm::EventSetup&) override;
+ template <typename T>
+ bool checkBXCollection(const T*&);
private:
edm::EDGetTokenT<l1t::MuonBxCollection> ugmtMuonToken_;
diff --git a/DQM/L1TMonitor/src/L1TObjectsTiming.cc b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
index c8edfe65685..bb6cfefa67f 100644
--- a/DQM/L1TMonitor/src/L1TObjectsTiming.cc
+++ b/DQM/L1TMonitor/src/L1TObjectsTiming.cc
@@ -1,4 +1,5 @@
#include "DQM/L1TMonitor/interface/L1TObjectsTiming.h"
+#include "FWCore/Utilities/interface/TypeDemangler.h"
L1TObjectsTiming::L1TObjectsTiming(const edm::ParameterSet& ps)
: ugmtMuonToken_(consumes<l1t::MuonBxCollection>(ps.getParameter<edm::InputTag>("muonProducer"))),
@@ -783,25 +784,44 @@ void L1TObjectsTiming::bookHistograms(DQMStore::IBooker& ibooker, const edm::Run
}
}
+template <typename T>
+bool L1TObjectsTiming::checkBXCollection(const T*& BXCollection) {
+ // check that all the BX collections do not exceed the expected range
+ if (BXCollection->getLastBX() - BXCollection->getFirstBX() > int(bxrange_ - 1)) {
+ edm::LogError("L1TObjectsTiming") << " Unexpected bunch crossing range in "
+ << edm::typeDemangle(typeid(BXCollection).name())
+ << " lastBX() = " << BXCollection->getLastBX()
+ << " - firstBX() = " << BXCollection->getFirstBX();
+ return false;
+ } else {
+ return true;
+ }
+}
+
void L1TObjectsTiming::analyze(const edm::Event& e, const edm::EventSetup& c) {
if (verbose_)
edm::LogInfo("L1TObjectsTiming") << "L1TObjectsTiming: analyze..." << std::endl;
// Muon Collection
- edm::Handle<l1t::MuonBxCollection> MuonBxCollection;
- e.getByToken(ugmtMuonToken_, MuonBxCollection);
+ const l1t::MuonBxCollection* MuonBxCollection = &e.get(ugmtMuonToken_);
+ if (!checkBXCollection(MuonBxCollection))
+ return;
// Jet Collection
- edm::Handle<l1t::JetBxCollection> JetBxCollection;
- e.getByToken(stage2CaloLayer2JetToken_, JetBxCollection);
+ const l1t::JetBxCollection* JetBxCollection = &e.get(stage2CaloLayer2JetToken_);
+ if (!checkBXCollection(JetBxCollection))
+ return;
// EGamma Collection
- edm::Handle<l1t::EGammaBxCollection> EGammaBxCollection;
- e.getByToken(stage2CaloLayer2EGammaToken_, EGammaBxCollection);
+ const l1t::EGammaBxCollection* EGammaBxCollection = &e.get(stage2CaloLayer2EGammaToken_);
+ if (!checkBXCollection(EGammaBxCollection))
+ return;
// Tau Collection
- edm::Handle<l1t::TauBxCollection> TauBxCollection;
- e.getByToken(stage2CaloLayer2TauToken_, TauBxCollection);
+ const l1t::TauBxCollection* TauBxCollection = &e.get(stage2CaloLayer2TauToken_);
+ if (!checkBXCollection(TauBxCollection))
+ return;
// EtSum Collection
- edm::Handle<l1t::EtSumBxCollection> EtSumBxCollection;
- e.getByToken(stage2CaloLayer2EtSumToken_, EtSumBxCollection);
+ const l1t::EtSumBxCollection* EtSumBxCollection = &e.get(stage2CaloLayer2EtSumToken_);
+ if (!checkBXCollection(EtSumBxCollection))
+ return;
// Open uGT readout record
edm::Handle<GlobalAlgBlkBxCollection> uGtAlgs; |
@mmusich I'll treat this as a priority tomorrow. |
Okay, a quick 15 minute dissection shows that the crashing bxvector here is (at least initially) the muon bxvector. It has 22 BX's in it (-10 to 11) (instead of the expected 5), and is quite (suspiciously?) full in general:
It is also worth noting under this set-up, we do get a previous event, which shows the proper 5 BX's:
My knee-jerk reaction seeing this is that it is data corruption, perhaps related to the ongoing uGT packer/unpacker problem, but I have no definitive proof of that. The exact crash of course occurs here, cmssw/DQM/L1TMonitor/src/L1TObjectsTiming.cc Lines 818 to 819 in e2d3811
It finds some theoretically valid muon in BX 7:
and then comes up with an "index" of 9 for it in the vector of histograms in that arcane equation there. It then tries to access cmssw/DQM/L1TMonitor/src/L1TObjectsTiming.cc Lines 122 to 131 in e2d3811
coincidentally, cmssw/DQM/L1TMonitor/src/L1TObjectsTiming.cc Lines 3 to 34 in e2d3811
instead of being some And when it tries to access the location defined by a bxrange of 5, at index 9, you get the usual c++ vector issue. This is going to crash (and crash un-informatively) therefore on anything with BXVectors with BX's outside of the expected 5. the collection these muons are loaded out of is defined by cmssw/DQM/L1TMonitor/python/L1TObjectsTiming_cfi.py Lines 4 to 7 in ecf1aae
which is pretty much directly the gtStage2 digis. I could try to trace it back further, but that is pretty much just the raw to digi step at that point: cmssw/EventFilter/L1TRawToDigi/python/gtStage2Digis_cfi.py Lines 3 to 8 in ecf1aae
(which is some evidence for corrupted data) |
I think so. Incidentally that's (almost) what I was proposing at #41645 (comment). Though I (personally) wouldn't crash the entire Prompt Reco because of a mismatch with the expected data format in a DQM module. Emitting a |
@mmusich forgive my inexperience on this but are I'm not sure we want to continue computation on some event we know has corrupted uGT data. We're just asking to crash somewhere else at that point, or continuing processing on junk. |
this is not what what happens at Tier-0 and in general it's not a good policy because of the reasons explained in the thread at #41512. |
FWIW I agree that it would be better to consider the event as bad and skip
it.
It is possible to throw a cms exception and configure the job to skip the
event. However it's not obvious how to avoid writing a partially
reconstructed event to the output files.
Maybe the Core Software group has some suggestions.
.A
|
Well, the L1 packer/unpacker pretty clearly needs some robustness upgrades in any case, but that is a longer term proposition. I can handle the immediate fallout and solution (logging an error and returning out, or throwing a defined exception) in any way the software and computing experts deem most appropriate. |
@cms-sw/core-l2
please advise, also in view of #41645 (comment) |
The problem with this approach is that the main consumer of the L1T information is the HLT, and the HLT will not be re-run during or after offline processing. So the information that the HLT used to route the event through the various paths are likely wrong, and the event should be dropped, because it cannot be properly accounted by the purity/efficiency trigger measurements. |
We agree, but then planning an exception in a DQM module that runs in Prompt (2 days after data is taken) is less then useless. |
Our approach so far has been to not throw exceptions if the processing can otherwise continue. To me the best option would be for the unpacker to produce a data product that in some conveys the information that the raw data was corrupt in some way. Then everything downstream can easily ignore it if necessary, and e.g. HLT could even have a filter rejecting the event. While we do have a facility to "skip events for which specific exceptions are thrown", as discussed in #41512, it has not been battle-tested, and generally exceptions should not be used for control flow. In principle process.options.SkipEvent = cms.untracked.vstring("<exception category>") should do the job. Based on some experimentation, if the OutputModule consumes the data product from the module that threw the exception, the OutputModule seems to automatically skip the event. If there is no data dependence chain from the throwing module to the OutputModule (which I believe is the case here given the exception coming from a DQM module), the OutputModule needs to be somehow instructed to act based on the Path where the module is. This could be |
Well, It isn't a fast solution, but I can take a look at the unpacker and see if there's a way to provide proper null sets out of the unpacker if it doesn't pass a few basic quality tests. |
Since this is corrupt data of just the muons in the GT record, I would agree that the best strategy would be to have the GT unpacker detect the error, report it (in a place where a human will eventually notice it - is anyone really checking the LogErrors in the certification workflow?), and produce an empty GT muon collection (no need to fail the other non-muon triggers because if this) |
It may be just the muons. The muons were simply the first thing that crashed. I can check other collections for corruption too. |
@makortel I didn't have a lot of time to think about this yesterday, but could I borrow your judgement on how to go about unpacker modifications for this? If I go through the various GT unpackers and add a step into the process where it checks it's own output for say, having 5 BX's as a check on whether the data is corrupted, I could conceivably log an error and force it to instead output some empty BXVector that is unlikely to disturb anything downstream of it, but I'm not sure, without looking at the log errors by eye, how one catches this and differentiates this event as a problem, instead of an event with some genuinely empty vector. I agree with @gpetruc that someone really does need to catch this but without disturbing other elements of the reconstruction, but I am concerned that simply outputting empty collections may help to bury the problem rather than catching it, alerting, and gracefully continuing. I entertained the idea of some sort of module-in-the-middle style solution that would run after the unpackers in the RawToDigi sequence to inspect and catch any potential data corruption and insert a bit or something into the output that could be used to flag corruption in the unpacker so that other downstream clients could catch this, or HLT could filter on it, but I would imagine this is more trouble than it is worth, involves sensitive modifications to everything reliant on the RawToDigi step, and is also not truly a solution to the issue of some downstream something attempting to process on bad data. I guess I would also be terrified of potentially introducing a mistaken 100% removal filter into the process because of bad configuration of this. The existing muon unpacker responsible for GT muons, currently shows some corruption checks that are designed to throw exceptions: cmssw/EventFilter/L1TRawToDigi/plugins/implementations_stage2/MuonUnpacker.cc Lines 13 to 58 in 9642389
... but from the sounds of it, this is pretty off the table at this point, having made it to this stage? |
How is data corruption handled in unpackers of other sub-detectors? (bringing in @cms-sw/reconstruction-l2) |
After discussing with @Dr15Jones we came to the conclusion that from the framework point of view using "empty containers" is not a good way to communicate processing failures, because there is no way to distinguish that from the container being empty for "legitimate reasons". It would be better if the consuming modules would be able to ask questions like "is the input data product valid" and "if the input data is not valid, why is it not". In principle the If there is interest in the following kind of behavior, we could look into evolving the
In the mean time, similar behavior can to large degree accomplished by
|
Okay. Then, if I am understanding correctly, L1 needs to:
In the long run the first set of things does seem ideal to me for this sort of situation, but I would like to help with the immediate solution first. |
@mmusich @makortel Following what I think was the desired short term solution, I have created a filter for the GT digis that will check for corruption (in this case, right now this is only defined as having output BX vectors with size different than a configured size), and will also attempt to produce either an empty BXvector if corruption is detected, or an identical BX vector in the case that it is not. The EDFilter will return false when any corruption of this kind is detected, and true whenever there is no corruption. @makortel my understanding is that this filter will then need to be inserted after the gt unpacking step, and anything reliant on the GT digis will need to instead be told to get their information from this? I will attempt to test this on the current problem recipe in any case. |
Hi @aloeliger,
So, wouldn't it be enough to have an |
The use case for the "copied-over-or-dummy" product are modules consuming the data product that are in Tasks (i.e. unscheduled) or that are in Paths but (for whatever reason) whose execution is not controlled by the said filter, the modules can not handle a missing input data product without throwing an exception, and the desired behavior is for those modules to not throw the missing data product exception (i.e. emulating the behavior outlined in #41645 (comment)). |
Correct. Although from the framework perspective it would be better if the actual producer of GT digis would not produce the data product in case of the error, because then the error condition is more clear for all consumers of the data product. Do I assume correctly that the GT digis are not stored as such on any data tier? |
Fwiw, HLT currently stores them in the output of certain data streams (looking for |
Thanks. I suppose in the HLT context the affected OutputModules are already configured such that when the aforementioned EDFilter is added to the Sequence containing I also see them being stored as part of
and HLTDebugFEVT
I was asking because for any non-temporary storage of these data products it would be cleanest to record the failure by storing the "not-produced" product, and was wondering how much of this use case needs to be covered. |
Does anyone have any feedback on this potential evolution of |
For whatever my relatively in-expert opinion is worth, I think it's a good idea for exactly these cases. This event crashes, and crashes for a good reason. However, we should be able to crash events, without crashing entire processes. |
In my understanding, it is. But I'll add that I still have to study this issue, and it's not clear to me what the exact changes to the HLT menu would be because of it (esp. when it comes to the "copied-over-or-dummy product"). The "aforementioned EDFilter" sounds similar in purpose to the workaround tested in [1]. On top of the general issue, it would be useful (to me) to understand better some aspects that are specific to the L1T unpacker.
Right, thanks for pointing this out. |
@mmusich is the file with this error still available to test the new filter on? |
I should add that the set of consuming modules that one needs to worry about includes all modules in EndPaths (whose execution is not controlled by |
@mmusich I just want to ping you again. The original method for recreating the error seems to no longer work because it can longer find the error file on eos. Is this still available for testing the short term solution? |
It seems you are not following the cmsTalk thread linked in the original issue description. |
My apologies for my absence, I was at US CMS last week. Thanks for providing this. I will start trying to test with the stored tarballs ASAP. |
@mmusich Apologies, but I am still running into issues trying to recreate the error so I can test my solution. The tarball does not include the file that the original parameter set/process runs on. Is the file |
@aloeliger you'll have to see to reproduce with what is there or ask @germanfgv for clarifications. |
apparently it's only on tape:
should be trivial to make a rucio rule to get it back though. |
The rucio rule should be trivial... but whether I have enough good will at wisconsin left to get them to put this onto their disk after the stunts I've pulled is another story altogether. I'll see what I can do. |
I take it back... the block that contains this is |
Is anybody able to test if the fix in #42214 can solve this issue? |
Apparently it's not straightforward to get back the RAW data for the Tier0 job that failed, see #41645 (comment) . If we are OK in getting a 2.5TB block at CERN, should be straightforward to check. |
It may solve the immediate problem, but the point remains the vector shouldn't have existed in the first place. I think there is some evidence at this point that there is something going wrong in one of the muon unpackers. I would be hesitant to insert this solution and call anything "fixed" |
Dear all,
there is one job failing Prompt Reco for run 367232, datatset
JetMET0
with aFatal Exception
as described in https://cms-talk.web.cern.ch/t/fatal-exception-in-prompt-reco-of-run-367232-datatset-jetmet0/23996The crash seems to originate from the module
L1TObjectsTiming
:The exception is reproducible on a
lxplus8
node underCMSSW_13_0_5_patch2
(el8_amd64_gcc11
).Full logs and
PSet.py
can be found at https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run367232_JetMET0/Reco/vocms014.cern.ch-415905-3-log.tar.gzWith this modified
PSet.py
file the crash occurs immediately:It should be noted that the crash is preceded by these warning (perhaps related):
The text was updated successfully, but these errors were encountered: