-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes in muon HLT reconstruction (reco::TrackExtra
product not found)
#39064
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
reco::TrackExtra
product not found)reco::TrackExtra
product not found)
The new error is at least somewhat illuminating since it comes from a different CMSSW module as before, one I am more familiar with. Looking at the code, the only place where a |
For completeness, below are the details to rerun the relevant HLT menu on the problematic events, even though the 3 HLT crashes do not seem to be reproducible offline.
|
I ran the recipe for Run-357442 by adding |
Ok, so I ran my earlier test in a flawed way that lead the ProductIDs to shift. Being more careful I see |
Thanks for having a look, Matti. Trying to find where the output collections of
A non-expert look inside suggests that the trackExtra ref is always set for the output tracks (I might be wrong):
Also, I don't know enough to really understand the error message. Based on the latter, can one say that the collection |
The exception in cmssw/DataFormats/Common/src/RefCore.cc Lines 125 to 131 in 2af4be6
which is called from RefCore::getProductPtr() cmssw/DataFormats/Common/src/RefCore.cc Lines 72 to 75 in 2af4be6
or from RefCore::getThinnedProductPtr() cmssw/DataFormats/Common/src/RefCore.cc Lines 103 to 107 in 2af4be6
Since the TrackExtraRef is a Ref to a std::vector<reco::TrackExtra> , the getThinnedProductPtr() is actually the one being called (even if there is no thinning going on). The code goes to check the existence of a thinned product if a non-thinned one is not found incmssw/DataFormats/Common/interface/RefItemGet.h Lines 68 to 76 in 2af4be6
The tryToGetProductWithCoreFromRef() ends up callingcmssw/DataFormats/Common/src/RefCore.cc Line 92 in 2af4be6
(as in RefCore::getProductPtr() ).
The cmssw/FWCore/Framework/src/EventPrincipal.cc Line 288 in 2af4be6
and cmssw/FWCore/Framework/src/EventPrincipal.cc Line 254 in 2af4be6
where this case could happen only if there is no product at all with the given ID, or the producing module was run but did not produce the product (both of which should be reproducible, and should not be the case where). I was first wondering why exactly the code ends up searching the
The conditions leading to the |
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@Dr15Jones and I discovered a "logical race condition" in the framework that would cause symptoms like this (but we can't tell if it is really causing these problems). The cmssw/RecoMuon/L2MuonProducer/plugins/L2MuonProducer.cc Lines 114 to 117 in 2af4be6
and Track objects hold Refs to the TrackExtra . Downstream modules consume only the Track collection. The order of produces() declarations dictates the order where Event::commit_aux() moves the products into the EventPrincipal (and to the corresponding ProductResolvers) after the EDProducer::produce() has successfully finished. When a scheduled module puts a product in a ProductResolver, the consumers of that product become eligible to run (unless some other product they depend on have not yet been produced)cmssw/FWCore/Framework/src/ProductResolvers.cc Lines 433 to 439 in 2af4be6
This means that the following can happen
A quick workaround (which I'm going to prepare) is to declare first the production of We will need some time to think for a more general solution (I'd guess the As far as I can tell, unscheduled modules are not affected, because for them the product insertion into ProductResolver does not impact module scheduling. Instead, upon prefetch a task that releases all the modules consuming that product is inserted into the cmssw/FWCore/Framework/src/ProductResolvers.cc Lines 468 to 515 in 2af4be6
|
This change works around a rare scheduling bug in the framework when these modules are run as scheduled, see cms-sw#39064
The PR is here #39201. I limited to modules that I easily saw were being used in the HLT. |
This change works around a rare scheduling bug in the framework when these modules are run as scheduled, see cms-sw#39064
I was able to reproduce the exception with the "Run-357442" by adding a 1-second sleep between the iterations of the loop cmssw/FWCore/Framework/src/Event.cc Lines 220 to 229 in d21a67b
for the hltL2Muons module (running with 1 stream and 4 threads).
I tested the mitigation in #39201 with this setup, and the test job succeeded. Of course this test does not imply that there couldn't be some other module(s), or other pairs of similar data products that could cause this problem to show up. |
I think I have a proper fix in #39245. At least it works with the "reproducer" in #39064 (comment) . |
+hlt Thanks @makortel |
+core |
This issue is fully signed and ready to be closed. |
Over the last few weeks, HLT suffered 3 online crashes coming the HLT-muon reconstruction; the first 2 errors are almost identical, while the 3rd one is rather similar to the first 2, but comes from a different producer. The error messages are given below [*].
So far, no one has been able to reproduce any of these errors locally with the relevant error-stream files.
I open an issue to keep track of this, and to ask experts if the error messages suggest to them anything about what might be going wrong.
This config file should be representative of the HLT menu used online during the crashes (representative at least for what concerns the sequences that contain the problematic modules).
FYI: @JanFSchulte @khaosmos93 (Muon-HLT contacts), @silviodonato @Martin-Grunewald @fwyzard
[*]
CMSSW_12_4_3
, Jul 29th):CMSSW_12_4_3
, Aug 1st):CMSSW_12_4_6
, Aug 14th):The text was updated successfully, but these errors were encountered: