-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ECAL - Add integrity checks for strip and xtal ids to GPU unpacker #41977
ECAL - Add integrity checks for strip and xtal ids to GPU unpacker #41977
Conversation
type ecal |
enable gpu |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41977/35939
|
A new Pull Request was created by @thomreis (Thomas Reis) for master. It involves the following packages:
@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
Should it be |
The CPU unpacker drops all following channels in the tower from the moment an invalid strip or crystal id is found. So |
please test |
Mhm... then, the whole block should stop processing, not just the thread
that encountered the error ?
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3cd2c7/33192/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
Yes, that would probably mimic the CPU behaviour best but it would make things more complicated. It also does not guarantee that the CPU and GPU digis match because the already unpacked crystals are kept in the CPU unpacker and only all crystals after the invalid one are dropped. |
Naively (I am not familiar with the code) a simple way would be to use a If we want to reproduce the cpu approach, where all crystals after the first failure are discarded, there could be a second pass that does that in case of failure. |
Given that the number of threads is 32 and a tower has 25 channels, of which some may even be zero suppressed, there should be only one pass of the loop in the current configuration. However, if the shared variable contains the offset in the data array for which the invalid channel was found all threads working on higher offsets could be stopped before writing the digis to the output memory block. If I am not wrong this would require a |
Yes. |
I do not think this PR has an effect on pixelTracks. |
Hi @thomreis , thanks for you reply. About this:
Do you expect to implement this? |
Yes but I will not be able to get to it in the next 2 weeks, unfortunately. So if we want to avoid the crashes as soon as possible I would suggest to merge this PR as it is and make an issue for me as a reminder that there is some work left to do. |
@cms-sw/reconstruction-l2 , kind ping about #41977 (comment). |
+reconstruction |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1
|
Sure, I opened #42090. |
PR description:
This PR adds integrity checks for strip ID and xtal ID to the GPU unpacker similar to the ones that exist in the CPU unpacker. This avoids crashes like the ones reported in #39568
For events with data corruption in a tower a difference in the number of unpacked digis can occur between the CPU unpacker and the GPU one because the former stops unpacking the tower if an inconsistency is detected in one channel, whereas the later unpacks the channels in parallel and an inconsistency in one channel does not affect the unpacking in other threads.
In many cases, however, an integrity problem in the raw data affects most of the channels in the tower. From the three instances reported in #39568 this year only one resulted in a difference of one digi being produced after the fix.
PR validation:
/store/relval/CMSSW_13_0_0/RelValTTbarToDilepton_14TeV/GEN-SIM-DIGI-RAW/130X_mcRun3_2022_realistic_v2-v3/00000/