-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Tests - pixel crashes on GPU Hilton #34659
Comments
A new Issue was created by @tsusa Tatjana Susa. @Dr15Jones, @perrotta, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign heterogeneous, reconstruction, hlt |
Is this the CMSONS-13106? (I'm not authorized to view that ticket) Is the error fully reproducible? (I guess so but want to make sure) The
tells that there are likely two exceptions in flight (which leads to call to
On the other hand, given the CUDA error message Given the statement of "data without pixel" on the slides, I'd concur with @slava77 (#34659 (comment)) that some protection is missing. I wonder what the printout
would say. |
based on https://cmsoms.cern.ch/cms/runs/report?cms_run=343762&cms_run_sequence=GLOBAL-RUN |
Just to add explicitly, I will improve this part of |
Hi @makortel , this is the log from the Hilton machine when we first experienced the crash. |
@makortel, here is a printout Using host libthread_db library "/lib64/libthread_db.so.1". Thread 20 "cmsRun" hit Catchpoint 1 (exception thrown), 0x00007ffff59de32e in __cxxabiv1::__cxa_throw (obj=0x7fff09213080, tinfo=0x7ffff7854398 , |
Two additional tests were done:
|
The log and the gdb look a bit different than I expected, so I'm taking a look (I'm also able to reproduce).
This observation makes me wonder if 11_3_3 could be missing some pixel development that is in 12_0_X that could make a difference? |
I added a |
Running with
This is awfully similar to cms-patatrack#306 (and cms-patatrack/pixeltrack-standalone#188), but with already single process run with single EDM stream. |
On cmssw/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorOnGPU.cc Lines 231 to 232 in bec73cc
earlier, right after
? |
I still don't understand why any
does not catch the error. But adding it into the destructor of CAHitNtupletGeneratorKernelsGPU
does. Just by poking around a bit I found out that without the zero hits protection it is specifically
that causes the "illegal memory access" error. |
To be even more specific, the following piece appears to be causing the error (when not including the zero hit check) cmssw/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu Lines 261 to 267 in e09e497
This piece of code was changed in 12_0_X in #33371. |
Dear all, I have just summarised our (TSG FOG) investigations into the different Pixel, HCAL and ECAL crashes that we were seeing related to the missing protections for when a subsystem is not included in the run: #34197 Concerning the Pixel PR #34684, as indicated (and reported in today's daily run meeting) we have tested it by pointing Hilton to a local CMSSW 11_3_3 install, with the PR included on top. We did not see any crashes when using the PixelOnly menu (e-log) when running over data from run 343762 which excludes ECAL, HCAL and Pixel. Best regards, |
Can this issue get closed now, then? |
+heterogeneous |
Dear all, From the FOG side, I would like to report that we have tested the full GPU menu in CMSSW_11_3_4 in run 344449 with Pixel, ECAL and HCAL out of the run and we saw no issues (as reported in this e-log and today's Daily Run meeting just now). This confirms that the updated protections as working well. Best regards, |
@cms-sw/hlt-l2 @cms-sw/reconstruction-l2 please sign |
What is there to sign? This is not a PR. ?? |
that's an issue, when it's resolved it needs to be signed in order to be closed. |
+1 |
+reconstruction |
This issue is fully signed and ready to be closed. |
Following a crash mentioned in [1], slide 5, we run
cmsDriver.py step3 --conditions auto:run3_hlt -s RAW2DIGI:RawToDigi_pixelOnly,RECO:reconstruction_pixelTrackingOnly,DQM:@pixelTrackingOnlyDQM --process reRECO --data --era Run3 --eventcontent RECO,MINIAOD,DQM --hltProcess reHLT --procModifiers pixelNtupletFit,gpu --scenario pp --datatier RECO,MINIAOD,DQMIO --filein file:/eos/cms/store/group/dpg_trigger/comm_trigger/TriggerStudiesGroup/FOG/CRUZET_2021_data/run_343762.root
It crashes at [2] (&inputDataWrapped is != 0 at that point) with an error
terminate called after throwing an instance of 'std::runtime_error'
what():
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_3-slc7_amd64_gcc900/build/CMSSW_11_3_3-build/tmp/BUILDROOT/402e2a5eeeb9630ea9f5469bb50cc947/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/src/HeterogeneousCore/CUDACore/src/ScopedContext.cc, line 86:
cudaCheck(cudaStreamAddCallback(stream, cudaScopedContextCallback, new CallbackData{waitingTaskHolder_, device}, 0));
cudaErrorIllegalAddress: an illegal memory access was encountered
@makortel, could you please have a look?
[1] https://indico.cern.ch/event/1062405/contributions/4468133/attachments/2288269/3889700/HLTReport_CRuZeT_RunOrganization_27.07.2021_Zarucki.pdf
[2] https://github.com/cms-sw/cmssw/blob/master/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc#L73
@czangela
The text was updated successfully, but these errors were encountered: