Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCAL local reconstruction in Alpaka [14.0.x] #45324

Merged
merged 1 commit into from
Jul 3, 2024

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Jun 27, 2024

PR description:

First implementation of MAHI in alpaka, running on CPU and GPU

  • reuse the minimisation code introduced by ECAL multifit
  • implement and use atomicMaxPair function
  • implement HCAL portable conditions
  • implement HCAL digis portable SoA
  • update the HCAL offline reconstruction using the alpaka modifier
  • HLT customisation for alpaka-based HCAL reconstruction, skipping the legacy conversion for HCAL PF clusters
  • various fixes from the code review

Quoting @kakwok

This PR port the CUDA implementation of Hcal Local Reconstruction (Mahi) to using Alpaka. Custom SoA data structure used in CUDA for HCAL condition data and rechits are replaced with PortableCollection. The current Alpaka implementation aims at reproducing the results from the CUDA implementation, no algorithmic changes are made.

There are 4 main pieces involved in the migration:
* digiConverter(hcalDigisProducerPortable): Convert CPU digis into SoA format
* Produce HCAL condition data in SoA format (Multiple producers)
* Mahi kernels (Mahi.dev.cc)
* Convert rechits from SoA to legacy format (HcalRecHitSoAToLegacy)

More details on code design are presented in the recent HLT GPU development meetings:
https://indico.cern.ch/event/1350955/
https://indico.cern.ch/event/1350953/
https://indico.cern.ch/event/1350952/
https://indico.cern.ch/event/1230377/
https://indico.cern.ch/event/1230374/

Note: this PR requires #45277 and #45278.

PR validation:

Running the HLT menu v1.3 over 100k L1-accepted events gives consistent results:

  • on CPU with the legacy HCAL code;
  • on GPU with the CUDA HCAL code;
  • on GPU with the alpaka HCAL code;
  • on CPU with the alpaka HCAL code.

Very minor discrepancies in the alpaka CPU vs alpaka GPU results (< 10⁻⁴) are being investigated, and may be addressed in a follow up PR.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #44910 to CMSSW 14.0.x for data taking.

  - first implementation of MAHI in alpaka, running on CPU and GPU
  - implement and use atomicMaxPair function
  - implement HCAL portable conditions
  - implement HCAL digis portable SoA
  - update the HCAL offline reconstruction using the alpaka modifier
  - HLT customisation for alpaka-based HCAL reconstruction,
    skipping the legacy conversion for HCAL PF clusters
  - various fixes from the code review
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

backport #44910

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

type hcal

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

enable gpu

@cmsbuild cmsbuild added this to the CMSSW_14_0_X milestone Jun 27, 2024
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 27, 2024

A new Pull Request was created by @fwyzard for CMSSW_14_0_X.

It involves the following packages:

  • CondFormats/DataRecord (alca, db)
  • CondFormats/HcalObjects (alca, db)
  • Configuration/PyReleaseValidation (pdmv, upgrade)
  • DQM/HcalTasks (dqm)
  • DataFormats/Common (core)
  • DataFormats/HcalDigi (simulation)
  • EventFilter/HcalRawToDigi (reconstruction)
  • HLTrigger/Configuration (hlt)
  • HeterogeneousCore/AlpakaInterface (heterogeneous)
  • RecoLocalCalo/Configuration (reconstruction)
  • RecoLocalCalo/HcalRecAlgos (reconstruction)
  • RecoLocalCalo/HcalRecProducers (reconstruction)

@AdrianoDee, @Dr15Jones, @Martin-Grunewald, @antoniovagnerini, @civanch, @consuegs, @francescobrivio, @fwyzard, @jfernan2, @kskovpen, @makortel, @mandrenguyen, @mdhildreth, @miquork, @mmusich, @nothingface0, @perrotta, @rvenditti, @saumyaphor4252, @smuzaffar, @srimanob, @subirsarkar, @sunilUIET, @syuvivida, @tjavaid can you please review it and eventually sign? Thanks.
@DryRun, @JanChyczynski, @Martin-Grunewald, @PonIlya, @ReyerBand, @abdoulline, @apsallid, @argiro, @bsunanda, @fabiocos, @makortel, @mariadalfonso, @missirol, @mmusich, @rchatter, @rovere, @rsreds, @sameasy, @seemasharmafnal, @silviodonato, @slomeo, @thomreis, @tocheng, @wang0jin, @wddgit, @youyingli, @yuanchao this is something you requested to watch as well.
@antoniovilela, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Jun 27, 2024

cms-bot internal usage

@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

+heterogeneous

@mmusich
Copy link
Contributor

mmusich commented Jun 27, 2024

@fwyzard do you mind changing the title of the PR to make it clear it's targeting the 14.0.X branch?

@fwyzard fwyzard changed the title HCAL local reconstruction in Alpaka HCAL local reconstruction in Alpaka [14.0.x] Jun 27, 2024
@fwyzard
Copy link
Contributor Author

fwyzard commented Jun 27, 2024

ops, sorry... done

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7de9f8/40114/summary.html
COMMIT: 525a14a
CMSSW: CMSSW_14_0_X_2024-06-26-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/45324/40114/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39744
  • DQMHistoTests: Total failures: 21
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 39723
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@tjavaid
Copy link

tjavaid commented Jul 1, 2024

+1

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 2, 2024

@perrotta may I ask you to sign this PR, like you signed the 14.1.x version (#44910 (comment)) ?
thanks !

@perrotta
Copy link
Contributor

perrotta commented Jul 2, 2024

@perrotta may I ask you to sign this PR, like you signed the 14.1.x version (#44910 (comment)) ?
thanks !

@fwyzard I usually wait for the master version to be merged and tested in the IBs: if there are no issues with those tests, I see no other reason not to sign also this backport. If alca/db signatures are needed even before the master version is merged, just let me know: it can be done, but let discuss it later on at the ORP.

@Martin-Grunewald
Copy link
Contributor

type hlt-int

@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Jul 3, 2024

@cms-sw/alca-l2 @cms-sw/core-l2 @cms-sw/db-l2 @cms-sw/pdmv-l2 - Please have a look at this backport PR and sign (the corresponding PR is in the master since CMSSW_14_1_X_2024-07-02-2300). Thank you!

@perrotta
Copy link
Contributor

perrotta commented Jul 3, 2024

+1

  • Quite several errors in the CMSSW_14_1_X_2024-07-02-1100 IB where the master version of this PR was merged, but none of them seems to be related to this PR. Having this PR merged in at least one cycle of the 14_0_X IBs before cutting a new release will definitely rule out the possibility that any of those IB faults are due to it.

@antoniovilela
Copy link
Contributor

+1

  • Quite several errors in the CMSSW_14_1_X_2024-07-02-1100 IB where the master version of this PR was merged, but none of them seems to be related to this PR. Having this PR merged in at least one cycle of the 14_0_X IBs before cutting a new release will definitely rule out the possibility that any of those IB faults are due to it.

Thank you Andrea. Agreed.

@antoniovilela
Copy link
Contributor

@cms-sw/alca-l2 @cms-sw/core-l2 @cms-sw/db-l2 @cms-sw/pdmv-l2 - Please have a look at this backport PR and sign (the corresponding PR is in the master since CMSSW_14_1_X_2024-07-02-2300). Thank you!

@smuzaffar
CC'ing Shahzad for Core.

@smuzaffar
Copy link
Contributor

+core

it is backport of #44910 . Core changes in DataFormats/Common/src/classes_def.xml i.e adding new dict for edm::StdArray<unsigned short, 11> look good

@antoniovilela
Copy link
Contributor

+1

@antoniovilela
Copy link
Contributor

merge

@cmsbuild cmsbuild merged commit 335ab6c into cms-sw:CMSSW_14_0_X Jul 3, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment