-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HLT crashes in GPU and CPU in collision runs #38453
Comments
A new Issue was created by @swagata87 Swagata Mukherjee. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign hlt, reconstruction |
New categories assigned: hlt,reconstruction @jpata,@missirol,@clacaputo,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@swagata87 could you provide the full stack traces for the job that failed with the segmentation violations? |
Three examples are pasted below:
The full list is here:
|
Dear tracker DPG, (@cms-sw/trk-dpg-l2) General instructions to set up CMSSW area in GPU nodes online, is here The HLT configuration file is: https://swmukher.web.cern.ch/swmukher/hlt_v5.py I have copied one In case it is useful, Then, at the end, the following block was added:
Let me know if something was unclear. |
@swagata87 thank you for providing these instructions ! @tsusa you can use the online GPU machines to reproduce the issue: ssh gpu-c2a02-39-01.cms
mkdir -p /data/$USER
cd /data/$USER
source /data/cmssw/cmsset_default.sh
cmsrel CMSSW_12_3_5
cd CMSSW_12_3_5
mkdir run
cd run
cp ~hltpro/error/hlt_error_run353941.py .
cmsRun hlt_error_run353941.py In my test the problem did not happen every time, I had to run the job a few times before it crashed:
It eventually crashed, though I'm not 100% sure if it was due to the same problem :-/ |
Yes, looks like the same crash:
|
As a guess, I think the problem is an extremely large amount of data is being requested to be copied which leads to some memory overwrite into a protected memory space. This is just based on what edm::Event::emplaceImpl is doing which is basically calling cmssw/DataFormats/SiPixelRawData/interface/SiPixelErrorsSoA.h Lines 13 to 14 in 6d2f660
|
So cms::cuda::SimpleVector does not initialize any of its member data in its constructor
If the first call to SiPixelDigiErrorsSoAFromCUDA::acquire hits this condition cmssw/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc Lines 54 to 55 in d573dd2
then this call in produce
will just copy a random number of bytes from a random memory address. |
@Dr15Jones thanks for investigating the issue.
This is intended, because a A minimal fix could be diff --git a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
index 4037b4d5061..554f1425cef 100644
--- a/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
+++ b/EventFilter/SiPixelRawToDigi/plugins/SiPixelDigiErrorsSoAFromCUDA.cc
@@ -28,7 +28,7 @@ private:
edm::EDPutTokenT<SiPixelErrorsSoA> digiErrorPutToken_;
cms::cuda::host::unique_ptr<SiPixelErrorCompact[]> data_;
- cms::cuda::SimpleVector<SiPixelErrorCompact> error_;
+ cms::cuda::SimpleVector<SiPixelErrorCompact> error_ = cms::cuda::make_SimpleVector<SiPixelErrorCompact>(0, nullptr);
const SiPixelFormatterErrors* formatterErrors_ = nullptr;
};
With it I have been able to run over 20 times on the same input as before without triggering any errors. |
PRs with this fix: |
Hm, looks like I am late to the party... but, if it's any help, here are instructions for the error seen in Run 353744 (AFAICT you have been testing with Run 353941). Running in Hilton this time:
I also see the same problem, it crashes only every once in a while. It's probably the same bug, but I add it here for completeness. |
I also have here the other crash, this one is fully reproducible:
It will always crash on the 52nd event,
PS: it's not needed to run on Hilton at all, I was running in offline-like mode. |
@trtomei could you clarify
Running online, I have not been able to reproduce the error using the |
@fwyzard To clarify:
Maybe sit together with me tomorrow and we solve this. |
Is this issue still relevant? |
actually, yesterday we had a crash which looks like the Run number: 360224
|
The files in ROOT format and the HLT configuration are in:
|
@cms-sw/tracking-pog-l2 In this issue, one HLT crash is not yet solved, and I would say we need help from tracking experts in order to find a fix. The crash is reproducible offline (see #38453 (comment)), it comes from the (HLT) pixel reconstruction, and it only happens on CPU, not on GPU (for what we have seen so far). Removing some |
I have a vague recollection of a comment from @VinInn sayng that we should simply remove the I think now it's OK to have ntuplets with 5 hits, so an alternative could be to change the condition to |
At least, removing the And just for my understanding: the fact that, for the same event, we do not see a ntuplet with size=5 on GPU can be expected? [1] https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_10_patch2/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h#L293 |
It does not happen on GPU because assert are removed. |
Okay, thanks, but still I tried to just print the ntuplet size while running on GPU, and I didn't see a size=5.. |
Thanks for having a look. I checked that (unsurprisingly) the HLT runs fine on these 'error events', for both CPU and GPU, after changing the 4 to a 5 in the asserts, so in the meantime I'll open PRs with that change to gain time. |
@cms-sw/hlt-l2 (now speaking with the ORM hat, in order to better coordinate the creation of the next patch releases):
|
Yes, that is my understanding.
There are two more issues, but those crashes have been rare: #39568 , which ECAL has promised to look into, and #38651, which might somehow have been a glitch (seen only once). FOG (@trtomei) can tell us if there are any new online crashes without a CMSSW issue. |
+hlt |
so the tuplet in question is joining layer-pairs 0,3,10,7,12, so all 6 BPIX1,2,3 and FPIX1,2,3 how can I run hlt_for_debug.py on GPU and NOT on CPU? |
Anyhow if we "observe" sextuplets we need to allow sextuplets in the code .... so the fix of the asserts is ok (the arrays were already over-dimensioned) |
The sextuplet is on GPU as well |
in case you are interested here are the coordinates of the hits
|
Looks like this was already solved. I add one comment for documentation purposes. The complication comes from the fact that the HLT menu includes 2 prescaled triggers that run the pixel CPU-only reco (which is why we saw the crash online). To ensure that only the pixel GPU reco is running, one solution is to remove them, but that's tricky to do starting from the full menu [1]; alternatively, one can just run 1 appropriate Path instead of the full menu (most times, this is enough for a reproducer) [2]. In the future, we/HLT should maybe try to build 'minimal' reproducers, e.g. not using the full menu if that's not needed. [1] Add at the end of del process.DQM_PixelReconstruction_v4
del process.AlCa_PFJet40_CPUOnly_v1
del process.HLT_PFJet40_GPUvsCPU_v1
process.hltMuonTriggerResultsFilter.triggerConditions = ['FALSE']
del process.PrescaleService
del process.DQMHistograms
dpaths = [foo for foo in process.paths_() if foo.startswith('Dataset_')]
for foo in dpaths: process.__delattr__(foo)
fpaths = [foo for foo in process.finalpaths_()]
for foo in fpaths: process.__delattr__(foo) [2] In this case, it could have been hltGetConfiguration run:360224 \
--data \
--no-prescale \
--no-output \
--globaltag 124X_dataRun3_HLT_v4 \
--paths AlCa_PFJet40_v* \
--max-events -1 \
--input file:run360224_ls0081_file1.root \
> hlt.py
cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
process.options.accelerators = ['cpu']
@EOF
cmsRun hlt.py &> hlt.log |
If it's not too much trouble to explain, I would be interested to know how to extract the information on layer pairs and r-z coordinates for a given candidate. I see the crash at |
|
I saw the printout twice so I added the ifdef part |
btw the method |
Thanks a lot for the info. |
(This issue is solved; the rest below is just me trying to learn things.) With Vincenzo's diff, I get what he wrote: same sextuplet on CPU and GPU. In my previous attempts, I had additional printouts in At least now I see what I was doing differently. [*] (yes, most of these printouts are pointless) diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index 4ec7069ac8e..bfefdf7ccd6 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -290,15 +290,37 @@ public:
auto doubletId = this - cells;
tmpNtuplet.push_back_unsafe(doubletId);
- assert(tmpNtuplet.size() <= 4);
+ assert(tmpNtuplet.size() <= 5);
+ if (tmpNtuplet.size()>4) {
+#ifdef __CUDACC__
+ printf("GPU ");
+#else
+ printf("CPU ");
+#endif
+ for (auto c : tmpNtuplet) printf("%d,",cells[c].theLayerPairId_);
+ printf(" r/z: ");
+ for (auto c : tmpNtuplet) printf("%f/%f,", cells[c].theInnerR,cells[c].theInnerZ);
+ auto c = tmpNtuplet[tmpNtuplet.size()-1]; printf("%f/%f,",cells[c].outer_r(hh),cells[c].outer_z(hh));
+ printf("\n");
+ }
bool last = true;
for (unsigned int otherCell : outerNeighbors()) {
if (cells[otherCell].isKilled())
continue; // killed by earlyFishbone
last = false;
+#ifdef __CUDACC__
+ printf("GPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU1 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
cells[otherCell].find_ntuplets<DEPTH - 1>(
hh, cells, cellTracks, foundNtuplets, apc, quality, tmpNtuplet, minHitsPerNtuplet, startAt0);
+#ifdef __CUDACC__
+ printf("GPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU2 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
}
if (last) { // if long enough save...
if ((unsigned int)(tmpNtuplet.size()) >= minHitsPerNtuplet - 1) {
@@ -331,7 +353,12 @@ public:
}
}
tmpNtuplet.pop_back();
- assert(tmpNtuplet.size() < 4);
+ assert(tmpNtuplet.size() < 5);
+#ifdef __CUDACC__
+ printf("GPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#else
+ printf("CPU3 tmpNtuplet.size = %d\n", tmpNtuplet.size());
+#endif
}
// Cell status management |
In my previous attempts, I had additional printouts in GPUCACell::find_ntuplets, and in that case I couldn't see the sextuplet on GPU. I think this is somewhat reproducible: I ran 30 times with this diff [*] and I could see the sextuplet on GPU in the printouts only 2 times (on CPU, I saw it 10 times out of 10).
This is surprising as we do not expect GPU vs CPU differences at this point of processing
will try to investigate more
|
@missirol
|
btw: printf from GPU is not guaranteed to appear if there are too many. |
Sorry for the trouble, then. I tested on
Thanks, didn't know, it might explain what I (didn't) see. [*] https_proxy=http://cmsproxy.cms:3128 \
hltGetConfiguration run:360224 \
--data \
--no-prescale \
--no-output \
--globaltag 124X_dataRun3_HLT_v4 \
--paths AlCa_PFJet40_v* \
--max-events -1 \
--input file:run360224_ls0081_file1.root \
> hlt.py
cat <<@EOF >> hlt.py
process.MessageLogger.cerr.FwkReport.limit = 1000
process.options.numberOfThreads = 1
#process.options.accelerators = ['cpu']
@EOF
cmsRun hlt.py &> hlt.log |
I think this is indeed the explaination [*]. Case closed, and sorry again for the noise. [*] I checked this by keeping the large number of printouts, but also adding #ifdef __CUDACC__
if (tmpNtuplet.size() > 4) {
__trap();
}
#endif and the program crashed 10/10 times on GPU (running only on the event in question), meaning each time there was a sextuplet on GPU. |
@swagata87 @missirol can this issue be considered concluded, and therefore closed? |
In my understanding, yes (I signed it). Swagata can confirm and close. |
yes, I am closing this issue. Thanks everyone! |
Dear experts,
During the week of June 13-20, following 3 types of HLT crashes happened in collision runs. HLT was using
CMSSW_12_3_5
.type 1
This crash happened on June 13th, during stable beams, collision at 900 GeV. Run number: 353709. The crash happened in a CPU(fu-c2a05-35-01). Elog: http://cmsonline.cern.ch/cms-elog/1143438. Full crash report: https://swmukher.web.cern.ch/swmukher/hltcrash_June13_StableBeam.txt
type 2
This type of crashes happened in GPUs (for example: fu-c2a02-35-01). It happened during collision runs when no real collisions were happening. On June 14th (run 353744, Pixel subdetector was out), and on June 18th (run 353932, 353935, 353941, Pixel and tracker subdetectors were out).
type 3
happened in fu-c2a02-39-01 (GPU), in collision run 353941 (Pixel and tracker subdetectors were out), no real collision was ongoing.
Reason of crash (2) and (3) might even be related.
Relevant elog on (2) and (3): http://cmsonline.cern.ch/cms-elog/1143515
Regards,
Swagata, as HLT DOC during June 13-20.
The text was updated successfully, but these errors were encountered: