Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for SiPixelRecHitFromCUDA crash during online GPU tests #35229

Merged

Conversation

czangela
Copy link
Contributor

@czangela czangela commented Sep 10, 2021

PR description:

These changes address the GPU error brought up in #34831 comment. [1]

The was error apparently was due to a digi at row 0 col 0, and the adc value not being (correctly) set in the packing, and then when checking in SiPixelDigisClustersFromSoA.cc#137 that one digi was skipped, so the associated cluster never got created, so later one ends up with one more rechit than actual clusters in that one module.

This is the actual digi being lost:
Digi index 754; clusid 8; rawIdArr 344545284; adc 25661, pdigi: 00000000000000000000000000000000

The condition digis.pdigi(i) == 0 was initially meant to check whether the digi was left uninitialized from the RawToDigi_kernel. But the problem is that a packed digi can be all zeros. For checking uninitialized digis, this condition is changed to digis.rawIdArr(i) == 0.

Other than that, to filter noisy/dead pixel digis the condition digis.adc(i) == 0 was added well, based on the treating of noisy/dead pixels in the calibDigis kernel.

[1]

cmsRun: /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_4-slc7_amd64_gcc900/build/CMSSW_11_3_4-build/tmp/BUILDROOT/58d469321d6ecb0427dd1e9f6c3703d5/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_4/src/RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitFromCUDA.cc:143: 
virtual void SiPixelRecHitFromCUDA::produce(edm::Event&, const edm::EventSetup&): 
Assertion `nhits <= dsv.size()' failed.


A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

...

Module: SiPixelRecHitFromCUDA:hltSiPixelRecHits@cuda (crashed)

PR validation:

Atm it was only checked that the recipe described in #34831 does not crash in 11_3_4.

if this PR is a backport please specify the original PR and why you need to backport that PR:

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35229/25186

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @czangela for master.

It involves the following packages:

  • RecoLocalTracker/SiPixelClusterizer (reconstruction)

@jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @mmusich, @threus, @tvami this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Sep 10, 2021

enable gpu

@slava77
Copy link
Contributor

slava77 commented Sep 10, 2021

abort test

@slava77
Copy link
Contributor

slava77 commented Sep 10, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8b6b98/18496/summary.html
COMMIT: 4ebe06e
CMSSW: CMSSW_12_1_X_2021-09-10-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35229/18496/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3001001
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3000973
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@tvami
Copy link
Contributor

tvami commented Sep 10, 2021

type bug-fix

@slava77
Copy link
Contributor

slava77 commented Sep 11, 2021

Probably an extra look at packing on the GPU is required and maybe a fix as well.

shouldn't the fix be done upstream?

What was the reason that a non-zero adc was computed for this digi while the packed value is 0?
Is it meaningful to write out a digi with a packed value 0 in the output digi collection?

Unless it's clear that this behavior is expected by design, if this is somehow time critical and can not be fixed upstream now, please add a LogWarning message for the pdigi==0 && adc != 0.

@tsusa
Copy link
Contributor

tsusa commented Sep 14, 2021

Probably an extra look at packing on the GPU is required and maybe a fix as well.

shouldn't the fix be done upstream?

What was the reason that a non-zero adc was computed for this digi while the packed value is 0?
Is it meaningful to write out a digi with a packed value 0 in the output digi collection?

Unless it's clear that this behavior is expected by design, if this is somehow time critical and can not be fixed upstream now, please add a LogWarning message for the pdigi==0 && adc != 0

@slava77 this was meant to be a quick fix just to prevent a crash. We are working on the proper (upstream) fix

@czangela czangela force-pushed the gpu_crash_34831_fix_2021_09_10_release12_1_X branch from 4ebe06e to f6d32fa Compare September 15, 2021 11:40
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35229/25297

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

Pull request #35229 was updated. @makortel, @jpata, @cmsbuild, @fwyzard, @slava77 can you please check and sign again.

@tsusa
Copy link
Contributor

tsusa commented Sep 16, 2021

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8b6b98/18653/summary.html
COMMIT: 1444e50
CMSSW: CMSSW_12_1_X_2021-09-15-2300/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35229/18653/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19735
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19729
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000833
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3000805
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@slava77
Copy link
Contributor

slava77 commented Sep 16, 2021

+reconstruction

for #35229 1444e50

  • code changes are in line with the PR follow up review and discussion
    • the PR description is a bit outdated; @czangela please update the text there.
  • jenkins tests pass and comparisons with the baseline show no relevant differences (the small diffs in the GPU pixel tracking wf 10824.502 are more likely from non-repeatability of the workflow results in vertexing)

@fwyzard
Copy link
Contributor

fwyzard commented Sep 16, 2021

Sorry for the delay, it took me a wile to put together an online-like HLT menu for CMSSW 12.0.x/12.1.x.

I can confirm that this PR fixes the crash that was observed in run 344676 .

@czangela, could you prepare a backport to CMSSW 12.0.x, please ?

@fwyzard
Copy link
Contributor

fwyzard commented Sep 16, 2021

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

  • bug fix

@czangela
Copy link
Contributor Author

I updated the description and added the backport to 12_0_X. @slava77 @fwyzard

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants