Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix overflow in ECAL GPU unpacking #39617

Merged
merged 1 commit into from
Oct 6, 2022

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 4, 2022

PR description:

Avoid a possible overflow in the ECAL GPU unpacking, that could lead to an out-of-bounds read from invalid memory.

The function find_next_tower_block() starts from the next memory location and keeps reading until it finds a valid payload, or it reaches the given trailer. However, there was no check that the initial value is not already at or beyond the trailer, which would result in loop that moves forward until it reaches an invalid memory address.

This condition has been observed online, for example

block (14,0,0) thread (0,0,0)
find_next_tower_block(current_tower_block = 0x7fa96ba042c8, trailer = 0x7fa96ba042c0, ...)

resulting in

Begin processing the 951st record. Run 359699, Event 317644311, LumiSection 218 on stream 0 at 05-Oct-2022 00:55:27.426 CEST

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7fff320ad230 (UnpackGPU.cu:57 in _ZN4ecal3raw21find_next_tower_blockERPKmS2_jj inlined from UnpackGPU.cu:252)

Thread 1 "cmsRun" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 854, grid 25544, block (14,0,0), thread (30,0,0), device 0, sm 28, warp 0, lane 30]
ecal::raw::kernel_unpack_test<32><<<(54,1,1),(32,1,1)>>> ()
    at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_9-el8_amd64_gcc10/build/CMSSW_12_4_9-build/tmp/BUILDROOT/dc6747a684df926e1faea7ef7c301e1a/opt/cmssw/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_9/src/EventFilter/EcalRawToDigi/plugins/UnpackGPU.cu:57 in _ZN4ecal3raw21find_next_tower_blockERPKmS2_jj inlined from UnpackGPU.cu:252

These changes will stop the loop if the initial value is at or beyond the trailer.

PR validation:

With these changes the ECAL unpacker runs successfully through 10'000 the error stream events.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to 12.4.x and 12.5.x for data taking.

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 4, 2022

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 4, 2022

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 4, 2022

@thomreis FYI

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 4, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39617/32414

  • This PR adds an extra 16KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 4, 2022

A new Pull Request was created by @fwyzard (Andrea Bocci) for master.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)

@mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@rchatter, @argiro, @Martin-Grunewald, @missirol, @thomreis, @simonepigazzini this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 5, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7d83b7/28006/summary.html
COMMIT: 77f22f1
CMSSW: CMSSW_12_6_X_2022-10-04-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39617/28006/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-7d83b7/41834.0_TTbar_14TeV+2026D94+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3432650
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3432628
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 49 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19876
  • DQMHistoTests: Total failures: 8
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19868
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@thomreis
Copy link
Contributor

thomreis commented Oct 5, 2022

Thanks for the fix @fwyzard

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 5, 2022

By the way, the code does look like it should be simplified and cleaned up a bit... but the minimal fix should be enough (?) for now.

@thomreis
Copy link
Contributor

thomreis commented Oct 5, 2022

I am currently working on unpacking the auxiliary collections and after that we wanted to start the Alpaka migration. Code simplification and cleanup could be done before or after the second task, depending on preference.

@clacaputo
Copy link
Contributor

+reconstruction

  • no reco changes

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

+1

@cmsbuild cmsbuild merged commit e120f54 into cms-sw:master Oct 6, 2022
@fwyzard fwyzard deleted the fix_ECAL_GPU_unpacking_overflow branch October 8, 2022 16:04
@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 8, 2022

I am currently working on unpacking the auxiliary collections and after that we wanted to start the Alpaka migration. Code simplification and cleanup could be done before or after the second task, depending on preference.

I'd do the clean up after the migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants