-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pixel track fitting and SiPixelRecHitFromCUDA crashes during online GPU tests (CMSSW_11_3_4) #34831
Comments
A new Issue was created by @mzarucki Mateusz Zarucki. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Did any of the crash happen running with GPUs, or did they all occur without any GPUs involved ? Is the data from the corresponding runs, lumisections and possible events available ? Could you prepare some instructions for reproducing the problem ? |
assign reconstruction, heterogeneous |
The first ELOG reports that
The call to
To investigate the crash, as usual we would need the full instructions to reproduce it:
|
Hi @fwyzard,
Correction: the Pixel track fitting crashes occurred on CPUs only (
There might be an issue with the HLT_Pixel path, which was added to run Pixel reconstruction (CMSHLT-2157) but does not filter on anything:
The HLT menu name is listed under [1] above. I will attach the full python config to this ticket.
We will work on a recipe to reproduce the errors, over the same input data (we contacted our DAQ colleagues to save it locally). We will keep you updated. Best, |
The menu Note: The only difference wrt. Best, |
Hi again, The .raw data could not be copied locally (only during the run it seems) and the repacked RAW .root files would need a bit time to be accessible via Rucio. Nevertheless, I tried to recreate the crashes using an older set of data from run 343387 (stored on EOS via our Rucio rule): with the following simplified recipe (which includes the input file config) that is to be run on a GPU machine (see TriggerDevelopmentWithGPUs):
Note: Here number of threads/streams is unset. After running for a while, I do see crashes, however, of a different nature:
Revisiting F3Mon I do see this crash on the GPU nodes (added e-log). Therefore, the initial report is incorrect that there is only one type of crash. Apologies for overlooking this. Therefore, to summarise, we saw this PixelRecHitFromCUDA crash only on the GPU nodes and the initial Pixel track fitting crash on the CPU nodes. I have updated the above responses to avoid confusion. Best, |
As far as I know, the pixel track reconstruction has not been designed to run at 0 tesla. My first suggestion would be to switch it off, and turn it on only once the magnetic field is nominal. |
After copying the input file locally¹ and updating the configuration to use it, the GPU error is consistently reproducible:
Instead, I have not been able to reproduce the CPU-only crash
|
Dear all, Today we performed a repeated test of the HLT GPU menu (e-log), with the intent of reproducing the above crashes. During the test, in run 344676 we were able to see the GPU crash (e-log) only. Considering the GPU menu is exactly the same as the cosmic menu in its content, the decision was made to keep running with the cosmics GPU menu. Roughly an hour into the run 344679 after the formal GPU tests, we saw the CPU crash (e-log) that we were trying to reproduce. Since DAQ enabled the error stream this morning, we were able to access the data from the crashes. The .raw files from the GPU crash (recipe above) have been saved in Here is a recipe reproduce the CPU crash (which can be done on a GPU node):
where the HLT config file has been modified to take the error stream data as input. Best regards, |
Note that the two crashes seem unrelated:
|
As mentioned in this elog: http://cmsonline.cern.ch/cms-elog/1122928, we attempted to reproduce the error during run 344675, but nothing occurred during the 24 minute run. When we included DT in the next run (344676), the error occurred (only GPU error), within 11 minutes. |
@cms-sw/trk-dpg-l2 @fwyzard |
No, not that I know of. |
the case with Will this need to be backported to 11_3_X and 12_0_X, or is 12_0_X enough? |
12_0_X is enough
|
This one has been fixed, right? |
"fixed" yes. |
+heterogneous |
+reconstruction |
@cmsbuild please close (typo in the heterogeneous comment prevented a full sig) |
+heterogeneous Just for the record... we should have picked a word easier to spell :-/ |
This issue is fully signed and ready to be closed. |
Dear all,
During our online GPU tests in CMSSW_11_3_4 (eg. run 34555 with Pixel, ECAL and HCAL in global, and with 2 GPU + 2 CPU FUs in the DAQ configuration) with a GPU menu that includes pixel reconstruction (CMSHLT-2157, [1]), we saw the following crashes (e-log):
We also saw these errors late (after 47 LS) into run 344558 (2 GPU + 2 CPU FUs) and relatively early into run 344560 (standard DAQ configuration with all FUs in).
What is rather strange is that these crashes were seen in standard CPU FUs (eg.
fu-c2a02-27-02
,fu-c2a02-45-02
,fu-c2a01-07-03
).We did not see any crashes during our previous test (e-log) in CMSSW_11_3_2 (run 343991). We also did not see any issues when testing on Hilton in CMSSW_11_3_4.
Best regards,
Mateusz on behalf of TSG FOG
[1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2
The text was updated successfully, but these errors were encountered: