Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECAL - Add integrity checks for strip and xtal ids to GPU unpacker #41977

Merged
merged 1 commit into from
Jun 26, 2023

Conversation

thomreis
Copy link
Contributor

PR description:

This PR adds integrity checks for strip ID and xtal ID to the GPU unpacker similar to the ones that exist in the CPU unpacker. This avoids crashes like the ones reported in #39568

For events with data corruption in a tower a difference in the number of unpacked digis can occur between the CPU unpacker and the GPU one because the former stops unpacking the tower if an inconsistency is detected in one channel, whereas the later unpacks the channels in parallel and an inconsistency in one channel does not affect the unpacking in other threads.
In many cases, however, an integrity problem in the raw data affects most of the channels in the tower. From the three instances reported in #39568 this year only one resulted in a difference of one digi being produced after the fix.

PR validation:

  • Does not crash anymore on runs 367771, 368547, 368724 set up with the recipe provided in #39568 .
  • Passes matrix tests 12434.512, 12434.513, and 12434.514
  • No differences seen between CPU and GPU digis on 7750 events from /store/relval/CMSSW_13_0_0/RelValTTbarToDilepton_14TeV/GEN-SIM-DIGI-RAW/130X_mcRun3_2022_realistic_v2-v3/00000/

@thomreis
Copy link
Contributor Author

type ecal

@cmsbuild cmsbuild added the ecal label Jun 15, 2023
@thomreis
Copy link
Contributor Author

enable gpu

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41977/35939

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @thomreis (Thomas Reis) for master.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@rchatter, @argiro, @Martin-Grunewald, @missirol, @thomreis, @wang0jin this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Jun 15, 2023

Should it be continue rather than break, to process the following entries ?

@thomreis
Copy link
Contributor Author

thomreis commented Jun 15, 2023

The CPU unpacker drops all following channels in the tower from the moment an invalid strip or crystal id is found. So break should be OK.

@thomreis
Copy link
Contributor Author

please test

@fwyzard
Copy link
Contributor

fwyzard commented Jun 15, 2023 via email

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3cd2c7/33192/summary.html
COMMIT: 3991ec5
CMSSW: CMSSW_13_2_X_2023-06-15-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41977/33192/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 8 lines from the logs
  • Reco comparison results: 9 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3196062
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3196033
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 40086
  • DQMHistoTests: Total failures: 405
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39681
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 6 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@thomreis
Copy link
Contributor Author

Yes, that would probably mimic the CPU behaviour best but it would make things more complicated. It also does not guarantee that the CPU and GPU digis match because the already unpacked crystals are kept in the CPU unpacker and only all crystals after the invalid one are dropped.
What would be the best way to achieve breaking the loop for all?

@fwyzard
Copy link
Contributor

fwyzard commented Jun 16, 2023

Naively (I am not familiar with the code) a simple way would be to use a __shared__ variable to signal all threads within a block if the should stop or continue after each crystal.

If we want to reproduce the cpu approach, where all crystals after the first failure are discarded, there could be a second pass that does that in case of failure.

@thomreis
Copy link
Contributor Author

Given that the number of threads is 32 and a tower has 25 channels, of which some may even be zero suppressed, there should be only one pass of the loop in the current configuration.

However, if the shared variable contains the offset in the data array for which the invalid channel was found all threads working on higher offsets could be stopped before writing the digis to the output memory block. If I am not wrong this would require a __syncthreads() though.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 16, 2023

If I am not wrong this would require a __syncthreads() though.

Yes.

@clacaputo
Copy link
Contributor

Hi @thomreis , I see some minor reco differences in 12434.587, for example in pixelTracks__RECO_obj_p

c_log10recoTracks_pixelTracks__RECO_obj_p

Is it something expected?

More differences here: link

@thomreis
Copy link
Contributor Author

I do not think this PR has an effect on pixelTracks.
Moreover, I do not think that for an MC WF the if statements that cause breaks of the loop will ever be true because MC should not contain corrupted raw data.

@clacaputo
Copy link
Contributor

Hi @thomreis , thanks for you reply.

About this:

However, if the shared variable contains the offset in the data array for which the invalid channel was found all threads working on higher offsets could be stopped before writing the digis to the output memory block. If I am not wrong this would require a __syncthreads() though.

Do you expect to implement this?

@thomreis
Copy link
Contributor Author

Yes but I will not be able to get to it in the next 2 weeks, unfortunately. So if we want to avoid the crashes as soon as possible I would suggest to merge this PR as it is and make an issue for me as a reminder that there is some work left to do.

@missirol
Copy link
Contributor

This fix is already an improvement, and a potential solution to #39568. I checked with @fwyzard, and I think he agrees with integrating this now, and improving it in a follow-up PR. I can open the issue to keep track of this follow-up.

@cms-sw/reconstruction-l2 , do you agree ?

@missirol
Copy link
Contributor

@cms-sw/reconstruction-l2 , kind ping about #41977 (comment).

@clacaputo
Copy link
Contributor

+reconstruction

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@missirol
Copy link
Contributor

will you take care of opening the issue to keep track of ECAL - Add integrity checks for strip and xtal ids to GPU unpacker #41977 (comment) ?

Sure, I opened #42090.

@thomreis thomreis deleted the ecal-gpu-unpacker-integrity-checks branch July 6, 2023 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants