-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crash in run-367906 (sistrip::FEDBuffer::findChannels()
)
#41786
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt (I let others assign to other groups, if needed.) |
New categories assigned: hlt @missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
This is another instance of recent HLT crashes that I can't reproduce offline (see for example #40174, #41741 and #41742). This time I can also include the full log of the CMSSW job that crashed (see [1]), but I don't know if that helps.
@smorovic , is it possible to draw any conclusions comparing the log of the CMSSW job [1] and the content of the error-stream files [2] ? [1] old_hlt_run367906_pid2793118.log [2] |
Event IDs in two raw files:
Last message in the log is from one of previous events (file):
Timestamps of last few files appearing locally at hltd for that process (last 3).
However, this looks ok. Last two open files by the process were also saved, older ones were alread handled and closed. For the crash, there is no information of event ID (only for Exception this is known). |
assign reconstruction FYI @cms-sw/tracking-pog-l2 |
New categories assigned: reconstruction @mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Possibly incidental, but there are two other threads in
|
So threads 36 and 6 (crashing one) are operating on the same cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h Lines 207 to 211 in 1bce7ad
cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h Lines 230 to 241 in 1bce7ad
cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h Lines 250 to 256 in 1bce7ad
I'm assuming the On a cursory look the Another possible thread-safety problem is in cmssw/DataFormats/Common/interface/DetSetVectorNew.h Lines 634 to 649 in 1bce7ad
Here the m_getter is defined as
but in practice is used as pointer to Getter which is defined as
and the LazyGetter<T>::fill() is not defined as const !cmssw/DataFormats/Common/interface/DetSetVectorNew.h Lines 608 to 614 in 1bce7ad
So if the concrete LazyGetter<T>::fill() is not thread-safe, it could cause problems. In this case the concrete LazyGetter<T> is ClusterFiller
(which I haven't digested yet) Note that despite of all I wrote above, I can't tell from the stack trace if the problem is really in thread safety or something else. |
This part is now addressed in #41853 . It helped me to reach conclusion that the
looks like it would be thread safe. |
The race condition mentioned above is fixed in #41872. I'm not convinced though it would be the full cause of the crash. Idealistically the race condition would only lead to |
Thanks for the suggested fix, @makortel ! |
@missirol Do you want it backported to 13_0_X? (since it is unclear whether is plays a role in the crash) |
If it's clear that it is a fix (even partial), I would be in favor of backporting it, since we will still use 13_0_X online for a while. If it helps, I can prepare the backports. |
Thanks, I'll prepare the backports after the review of #41872 completes (in the current form it is easily cherry-pickable). |
As long as we're looking at DetSetNew, we're getting with some frequency DetSetNew assertion failures on aarch64
from here: cmssw/DataFormats/Common/interface/DetSetNew.h Lines 84 to 88 in 1bce7ad
The test at line 85 looks to be wrong--using a bitwise OR instead of logical, and m_offset is initialized to -1. There's probably also a race condition, but I haven't stared at it long enough yet. Stack trace:
|
I agree (especially on the
At least the code has cmssw/RecoTracker/MeasurementDet/plugins/TkStripMeasurementDet.cc Lines 39 to 43 in 8617c80
which ends up calling
which is part of the race condition I'm trying to fix in #41872 (assuming the stack trace is from an HLT job that does the on-demand strip unpacking and clustering; if not, the cause is likely something else) |
I think this is the case, as the config had process.hltSiStripRawToClustersFacility = cms.EDProducer( "SiStripClusterizerFromRaw",
onDemand = cms.bool( True ),
[..] |
I meant Dan's stack trace on the assertion failure on aarch64 (sorry for being unclear). |
|
Reporting another HLT crash which may be related to this issue.
|
Reporting another HLT crash which may be related to this issue.
|
Extracting more stack trace from #41786 (comment)
|
only one thread was in
On the other hand, this observation supports my earlier hunch of the race condition in |
@Dr15Jones pointed out that after #41872 the cmssw/RecoTracker/MeasurementDet/src/TkMeasurementDetSet.h Lines 240 to 241 in e18c96d
|
Fix proposed in #41936 (to be backported to 13_0_X as well) |
The fixes in #41872 and #41936 were integrated and backported, and After HLT deployed
|
type tracking |
Thanks @missirol for reporting the new stack trace. I didn't see anything obviously related activity in the other threads. I suppose the further investigation should focus on the contents of the
|
@dan131riley , would it be useful to backport #42194 to |
That PR is entirely about reducing false positives, it wouldn't help with the HLT crashes. |
Naive question: are there circumstances where the FEDRawDataCollection could get released while the event is still in progress? Currently the on-demand getter holds a reference to the FEDRawDataCollection--should it be keeping a Handle to the FEDRawDataCollection instead? |
@dan131riley it is possible to tell the framework to delete a data product early. See IF FEDRawDataCollection is marked for delete early, one must also specify any data products which reference (say by holding pointers to or even
|
As far as I can see from a recent configuration (attached: hlt.py.gz), HLT does not perform any early deletion. |
Thanks, that all makes sense. I'm having trouble constructing scenarios that could account for the crashes in |
sistrip::FEDBuffer::findChannels()
)
Adding a belated summary of recent online crashes which might be related to this issue. All the runs below are 2023 pp-collisions runs after run-369870. The CMSSW release used in these runs was Legend: run number, [total number of online crashes] number of crashes possibly related to this issue (based on my naive reading of the attached stack traces).
[*] Recipe tested on
|
If all the crashes are there since CMSSW_13_0_9, maybe #42033 is related ? |
I doubt it, since the first report is from May 28th ( |
Ah OK, thanks for pointing this out. |
This type of crash didn't happen at all in 2024. Should we consider closing this issue? |
cms-bot internal usage |
In run-367906 (pp collisions), DAQ reported 1 CMSSW crash at HLT (release:
CMSSW_13_0_6
) [link to HLT elog].The stack trace is attached (f3mon_run367906.txt). A piece of stack trace which is possibly relevant is in [1].
The corresponding error-stream files are available, but first attempts to reproduce the crashes offline failed (tried on "Hilton" HLT node).
The recipe used for those failed attempts is adapted in [2] to be valid for
lxplus
andlxplus-gpu
.FYI: @cms-sw/hlt-l2 @silviodonato @fwyzard @mzarucki @trtomei
[1]
[2]
The text was updated successfully, but these errors were encountered: