Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PFClusterSoAProducer to read a device collection #46830

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Nov 30, 2024

PR description:

Fix PFClusterSoAProducer to read a device collection instead of a host collection, when running on a GPU backend.

Note:this is a quick workaround to let the device code use the device collection, while being able to access the actual number of pf rechits on the host side. It should replaced with a better and more general implementation, and the use of the host collection should be removed.

PR validation:

Full 2024 HLT menu works with these changes, both on CPU and on GPU.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

May be backported to 14.2.x or earlier if there is interest.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 30, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for master.

It involves the following packages:

  • RecoParticleFlow/PFClusterProducer (reconstruction)

@jfernan2, @mandrenguyen can you please review it and eventually sign? Thanks.
@felicepantaleo, @hatakeyamak, @lgray, @missirol, @mmarionncern, @rovere, @sameasy, @seemasharmafnal this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

please test

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #46830 was updated. @jfernan2, @mandrenguyen can you please check and sign again.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

please hold

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

OK, the proposed fix cannot work, because we need to know the number of rechits on the host:

      if (pfRecHits->metadata().size() != 0)
        nRH = pfRecHits->size();

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

I guess this is a common enough pattern that we should find a general solution 🤔

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-GPU
Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0aff9f/43166/summary.html
COMMIT: c0252e9
CMSSW: CMSSW_15_0_X_2024-11-29-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46830/43166/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-GPU

  • 12834.42312834.423_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation/step3_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation.log
  • 12834.42212834.422_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation/step3_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 10 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3484682
  • DQMHistoTests: Total failures: 521
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3484141
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: found differences in 1 / 44 workflows

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

please unhold

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

assign heterogeneous

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 30, 2024

@makortel what do you think ?

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

Pull request #46830 was updated. @fwyzard, @jfernan2, @makortel, @mandrenguyen can you please check and sign again.

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0aff9f/43169/summary.html
COMMIT: 2bba89f
CMSSW: CMSSW_15_0_X_2024-11-29-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46830/43169/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 2 lines to the logs
  • Reco comparison results: 71 differences found in the comparisons
  • DQMHistoTests: Total files compared: 46
  • DQMHistoTests: Total histograms compared: 3484682
  • DQMHistoTests: Total failures: 1255
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3483407
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 45 files compared)
  • Checked 202 log files, 172 edm output root files, 46 DQM output files
  • TriggerResults: found differences in 1 / 44 workflows

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 7
  • DQMHistoTests: Total histograms compared: 53058
  • DQMHistoTests: Total failures: 54
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 53004
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 6 files compared)
  • Checked 24 log files, 30 edm output root files, 7 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Dec 2, 2024

type bug-fix

@makortel
Copy link
Contributor

makortel commented Dec 2, 2024

@makortel what do you think ?

Looks reasonable for a (better) workaround.

I guess this is a common enough pattern that we should find a general solution 🤔

Agreed, how about a new GitHub issue on this topic?

(IIRC in the pixel code the approach was to allocate memory and launch kernels based on the capacity of the containers rather than the "number of elements", and the "number of elements" was used only on the device code to terminate the loops early)

@jfernan2
Copy link
Contributor

jfernan2 commented Dec 3, 2024

+1

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 6, 2024

See #46887 for an alternative (and hopefully better) approach.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 9, 2024

Here is the impact of this PR on the HCAL+PF reconstruction, measured on a machine with 2× AMD Bergamo CPUs and 4× NVIDIA L4 GPUs.

baseline

Running 4 times over 20500 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  7717.3 ±   0.2 ev/s (20000 events, 96.5% overlap),   7708.2 ±   0.2 ev/s (⩾ 17570 events, overlap-only)
  7731.3 ±   0.2 ev/s (20000 events, 96.6% overlap),   7726.6 ±   0.2 ev/s (⩾ 17680 events, overlap-only)
  7744.3 ±   0.2 ev/s (20000 events, 97.2% overlap),   7738.5 ±   0.2 ev/s (⩾ 17880 events, overlap-only)
  7737.4 ±   0.2 ev/s (20000 events, 96.9% overlap),   7730.5 ±   0.2 ev/s (⩾ 17690 events, overlap-only)
 --------------------
  7732.6 ±  11.5 ev/s,   7725.9 ±  12.8 ev/s (⩾ 17570 events, overlap-only)

with this PR

Running 4 times over 20500 events with 16 jobs, each with 32 threads, 24 streams, and 1 GPUs
  8543.5 ±   0.2 ev/s (20000 events, 98.6% overlap),   8541.6 ±   0.2 ev/s (⩾ 19300 events, overlap-only)
  8533.4 ±   0.1 ev/s (20000 events, 98.0% overlap),   8532.1 ±   0.1 ev/s (⩾ 19070 events, overlap-only)
  8546.9 ±   0.1 ev/s (20000 events, 98.4% overlap),   8545.7 ±   0.1 ev/s (⩾ 19240 events, overlap-only)
  8538.7 ±   0.1 ev/s (20000 events, 98.5% overlap),   8538.3 ±   0.1 ev/s (⩾ 19250 events, overlap-only)
 --------------------
  8540.6 ±   5.9 ev/s,   8539.4 ±   5.8 ev/s (⩾ 19070 events, overlap-only)

i.e. a 10% speed up.

@fwyzard
Copy link
Contributor Author

fwyzard commented Dec 9, 2024

Closing in favour of #46887.

@fwyzard fwyzard closed this Dec 9, 2024
@fwyzard fwyzard deleted the fix_PFClusterSoAProducer_input_collection branch January 21, 2025 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants