Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflow in ECAL GPU unpacking [12.5.x] #39618

Merged

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 4, 2022

PR description:

Avoid a possible overflow in the ECAL GPU unpacking, that could lead to an out-of-bounds read from invalid memory.

The function find_next_tower_block() starts from the next memory location and keeps reading until it finds a valid payload, or it reaches the given trailer. However, there was no check that the initial value is not already at or beyond the trailer, which would result in loop that moves forward until it reaches an invalid memory address.

This condition has been observed online, for example

block (14,0,0) thread (0,0,0)
find_next_tower_block(current_tower_block = 0x7fa96ba042c8, trailer = 0x7fa96ba042c0, ...)

resulting in

Begin processing the 951st record. Run 359699, Event 317644311, LumiSection 218 on stream 0 at 05-Oct-2022 00:55:27.426 CEST

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fff320ad230 (UnpackGPU.cu:57 in _ZN4ecal3raw21find_next_tower_blockERPKmS2_jj inlined from UnpackGPU.cu:252)

Thread 1 "cmsRun" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 854, grid 25544, block (14,0,0), thread (30,0,0), device 0, sm 28, warp 0, lane 30]
ecal::raw::kernel_unpack_test<32><<<(54,1,1),(32,1,1)>>> ()
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/EventFilter/EcalRawToDigi/plugins/UnpackGPU.cu:57 in _ZN4ecal3raw21find_next_tower_blockERPKmS2_jj inlined from UnpackGPU.cu:252

These changes will stop the loop if the initial value is at or beyond the trailer.

PR validation:

With these changes the ECAL unpacker runs successfully through 10'000 the error stream events.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backported of #39617 to 12.5.x for data taking.

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 4, 2022

A new Pull Request was created by @fwyzard (Andrea Bocci) for CMSSW_12_5_X.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@rchatter, @argiro, @Martin-Grunewald, @missirol, @thomreis, @simonepigazzini this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 4, 2022

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 4, 2022

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 5, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ec356f/28004/summary.html
COMMIT: 77f22f1
CMSSW: CMSSW_12_5_X_2022-10-04-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39618/28004/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 51
  • DQMHistoTests: Total histograms compared: 3699454
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3699430
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 50 files compared)
  • Checked 212 log files, 49 edm output root files, 51 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • Reco comparison had 3 failed jobs
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19868
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@clacaputo
Copy link
Contributor

+reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2022

This pull request is fully signed and it will be integrated in one of the next CMSSW_12_5_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_12_6_X is complete. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Oct 7, 2022

backport of of #39617

@perrotta
Copy link
Contributor

perrotta commented Oct 7, 2022

+1

@missirol
Copy link
Contributor

missirol commented Oct 8, 2022

backport of #39617

@fwyzard fwyzard deleted the fix_ECAL_GPU_unpacking_overflow branch October 8, 2022 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants