Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14_0_X] Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44699

Conversation

makortel
Copy link
Contributor

PR description:

This PR is a backport of #44698. Original description

This PR is a temporary workaround for the issue discussed in #44643. In short, with an HLT menu that uses Alpaka modules, when some HLT processes use GPU and some do not, the fact that the different backends of Alpaka modules produce different transient data products cause the ProductIDs to be different between the CPU-only and CPU+GPU processes, and presently there is not enough metadata available in the final streamer file for the framework to keep properly track of the various indices, that leads to de-referencing of edm::Ref to fail in a subsequent job (for longer description see #44643 (comment)).

This PR works around the problem by making a subset(*) of the Alpaka EDProducers on the CPU serial backend, for each "device-side data product" they produce (that in reality are the "host-side data products" directly) they register the production of the corresponding CUDA backend data product. This hack makes the CPU-serial and CUDA backend EDProducers to register the production of exactly the same data products, that leads to equal ProductIDs for the same data products between CPU-only and CPU+GPU jobs, circumventing the problem.

(*) ECAL, PF, and phase1 Pixel EDProducers, that are either being used in the present HLT menu, or are planned to be used in the near future

The use of strings for the CUDA backend data products is ugly, but avoids the need to have a compile-time dependence on the CUDA backend code in the CPU serial backend code (I actually tried that first, but ran into compilation issues; probably our present build rules prevent the use of Alpaka CUDA backend types in the CPU serial backend code). At runtime the added explicit CUDA dependence will likely break on platforms that do not support CUDA (all our code directly depending on CUDA would be broken anyway, so the temporary loss of functionality seems acceptable).

This PR is intended to be reverted when the necessary metadata to deal with the different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework), whose details are being discussed in #44643 .

PR validation:

See #44698

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Backport of #44698

…ackends in Alpaka modules possibly used at HLT

To be reverted in the near future after the necessary metadata to deal
with different ProductIDs in different HLT processes (CPU-only vs.
CPU+GPU) gets propagated to the framework.
@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 10, 2024

A new Pull Request was created by @makortel for CMSSW_14_0_X.

It involves the following packages:

  • EventFilter/EcalRawToDigi (reconstruction)
  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • RecoLocalCalo/EcalRecProducers (reconstruction)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoParticleFlow/PFClusterProducer (reconstruction)
  • RecoParticleFlow/PFRecHitProducer (reconstruction)
  • RecoTracker/PixelSeeding (reconstruction)
  • RecoTracker/PixelVertexFinding (reconstruction)
  • RecoVertex/BeamSpotProducer (alca, reconstruction)

@perrotta, @fwyzard, @mandrenguyen, @consuegs, @cmsbuild, @jfernan2, @saumyaphor4252, @makortel can you please review it and eventually sign? Thanks.
@dkotlins, @mtosi, @tvami, @JanFSchulte, @GiacomoSguazzoni, @threus, @youyingli, @thomreis, @francescobrivio, @rovere, @sameasy, @ferencek, @tsusa, @missirol, @argiro, @VourMa, @ReyerBand, @felicepantaleo, @apsallid, @fabiocos, @Martin-Grunewald, @VinInn, @wang0jin, @gpetruc, @rchatter, @lgray, @mmarionncern, @dgulhan, @mmusich, @hatakeyamak, @yuanchao, @rsreds, @seemasharmafnal, @mroguljic, @tocheng this is something you requested to watch as well.
@antoniovilela, @sextonkennedy, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 10, 2024

cms-bot internal usage

@makortel
Copy link
Contributor Author

enable gpu

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@makortel
Copy link
Contributor Author

urgent

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-69312b/38763/summary.html
COMMIT: a77f7cd
CMSSW: CMSSW_14_0_X_2024-04-10-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44699/38763/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 84 lines to the logs
  • Reco comparison results: 55 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3308451
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3308422
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 202 log files, 165 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 32 differences found in the comparisons
  • DQMHistoTests: Total files compared: 3
  • DQMHistoTests: Total histograms compared: 39740
  • DQMHistoTests: Total failures: 1093
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 38647
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
  • Checked 8 log files, 10 edm output root files, 3 DQM output files
  • TriggerResults: no differences found

@jfernan2
Copy link
Contributor

+1

@perrotta
Copy link
Contributor

backport of #44698

@perrotta
Copy link
Contributor

+alca

@fwyzard
Copy link
Contributor

fwyzard commented Apr 11, 2024

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next CMSSW_14_0_X IBs (tests are also fine) and once validation in the development release cycle CMSSW_14_1_X is complete. This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor Author

@cms-sw/orp-l2 I believe this PR would satisfy the constraints of a patch release

@antoniovilela
Copy link
Contributor

@cms-sw/orp-l2 I believe this PR would satisfy the constraints of a patch release

Many thanks Matti. I was about to ask you this.

@antoniovilela
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants