Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing CUDA/gpu from Pixel code configs and dropping all CUDA wfs #46853

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

AdrianoDee
Copy link
Contributor

@AdrianoDee AdrianoDee commented Dec 3, 2024

PR description:

This PR proposes:

  • the removal of all the CUDA modules from pixel-related configs and of all the CUDA Patatrack wfs;
  • the removal of pixelNtupletFit_cff modifier.

A subsequent step would be to remove the gpu modifier, but since this involves code from many parties, I prefer to have it separated.

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

cms-bot internal usage

@AdrianoDee
Copy link
Contributor Author

AdrianoDee commented Dec 3, 2024

Well, apparently already these changes involve many parties. So maybe I'll push the drop of the gpu modifier already here.

@AdrianoDee
Copy link
Contributor Author

enable gpu

@AdrianoDee
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46853/42877

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

A new Pull Request was created by @AdrianoDee for master.

It involves the following packages:

  • Configuration/ProcessModifiers (operations)
  • Configuration/PyReleaseValidation (pdmv, upgrade)
  • EventFilter/SiPixelRawToDigi (reconstruction)
  • RecoHI/HiTracking (reconstruction)
  • RecoLocalTracker/SiPixelClusterizer (reconstruction)
  • RecoLocalTracker/SiPixelRecHits (reconstruction)
  • RecoTracker/PixelTrackFitting (reconstruction)
  • RecoVertex/BeamSpotProducer (reconstruction, alca)
  • RecoVertex/Configuration (reconstruction)

@AdrianoDee, @Moanwar, @antoniovilela, @atpathak, @consuegs, @davidlange6, @DickyChant, @fabiocos, @jfernan2, @mandrenguyen, @miquork, @perrotta, @rappoccio, @srimanob, @subirsarkar can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @VinInn, @VourMa, @dgulhan, @dkotlins, @fabiocos, @felicepantaleo, @ferencek, @francescobrivio, @gpetruc, @jazzitup, @kurtejung, @makortel, @mandrenguyen, @martinamalberti, @missirol, @mmusich, @mroguljic, @mtosi, @rovere, @rsreds, @slomeo, @threus, @tocheng, @tsusa, @tvami, @yenjie, @yetkinyilmaz, @yuanchao this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@jfernan2
Copy link
Contributor

jfernan2 commented Dec 3, 2024

assign heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 3, 2024

-1

Failed Tests: UnitTests RelVals RelVals-GPU RelVals-INPUT AddOn
Size: This PR adds an extra 56KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-693680/43213/summary.html
COMMIT: 5d626de
CMSSW: CMSSW_15_0_X_2024-12-03-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46853/43213/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found 34 errors in the following unit tests:

---> test TestDQMOnlineClient-es_dqm_sourceclient had ERRORS
---> test TestDQMOnlineClient-beamhlt_dqm_sourceclient had ERRORS
---> test TestDQMOnlineClient-csc_dqm_sourceclient had ERRORS
and more ...

RelVals

  • 135.4135.4_ZEEFS_13/step1_ZEEFS_13.log
  • 1306.01306.0_SingleMuPt1_UP15/step1_SingleMuPt1_UP15.log
  • 7.37.3_CosmicsSPLoose2018/step1_CosmicsSPLoose2018.log
Expand to see more relval errors ...

RelVals-GPU

  • 12834.42312834.423_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation/step1_TTbar_14TeV+2024_Patatrack_HCALOnlyGPUandAlpaka_Validation.log
  • 12834.42212834.422_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation/step1_TTbar_14TeV+2024_Patatrack_HCALOnlyAlpaka_Validation.log
  • 12834.40612834.406_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka/step1_TTbar_14TeV+2024_Patatrack_PixelOnlyTripletsAlpaka.log
Expand to see more relval errors ...

RelVals-INPUT

  • 159.01159.01_HydjetQ_reminiaodPbPb2022_INPUT/step2_HydjetQ_reminiaodPbPb2022_INPUT.log
  • 136.875136.875_RunDoubleMuon2018C/step2_RunDoubleMuon2018C.log
  • 2500.2252500.225_jmeNANOrePuppimc140X/step2_jmeNANOrePuppimc140X.log
Expand to see more relval errors ...

AddOn Tests

UNKNOWN
UNKNOWN
UNKNOWN
Expand to see more addon errors ...

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-693680/43502/summary.html
COMMIT: 40d0ebc
CMSSW: CMSSW_15_0_X_2024-12-17-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46853/43502/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 108 lines from the logs
  • Reco comparison results: 132 differences found in the comparisons
  • DQMHistoTests: Total files compared: 47
  • DQMHistoTests: Total histograms compared: 3623082
  • DQMHistoTests: Total failures: 2686
  • DQMHistoTests: Total nulls: 11
  • DQMHistoTests: Total successes: 3620365
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.8100000000000004 KiB( 46 files compared)
  • DQMHistoSizes: changed ( 12434.7,... ): -0.012 KiB MessageLogger/Errors
  • DQMHistoSizes: changed ( 12434.7,... ): -0.012 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 24834.911,... ): -0.008 KiB MessageLogger/Errors
  • DQMHistoSizes: changed ( 24834.911,... ): -0.008 KiB MessageLogger/Warnings
  • DQMHistoSizes: changed ( 4.22,... ): -0.004 KiB MessageLogger/Errors
  • DQMHistoSizes: changed ( 4.22,... ): -0.004 KiB MessageLogger/Warnings
  • Checked 206 log files, 177 edm output root files, 47 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

@AdrianoDee
Copy link
Contributor Author

+pdmv

@Moanwar
Copy link
Contributor

Moanwar commented Dec 17, 2024

+Upgrade

@jfernan2
Copy link
Contributor

+1

@antoniovagnerini
Copy link

Since this PR affects the DQM clients, from DQM side we need to carry out the usual P5 tests. However, the playback machines are down due to the programmed YETs, so we will be able to test only after the Christmas break.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 18, 2024

the playback machines are down due to the programmed YETs, so we will be able to test only after the Christmas break

On the other hand, blocking the development of other groups for weeks does not seem like a good approach :-/

@fwyzard
Copy link
Contributor

fwyzard commented Dec 18, 2024

@antoniovagnerini maybe off topic, but what machines would you need to be on, to run the tests ?

@mmusich
Copy link
Contributor

mmusich commented Dec 18, 2024

Since this PR affects the DQM clients, from DQM side we need to carry out the usual P5 tests. However, the playback machines are down due to the programmed YETs, so we will be able to test only after the Christmas break.

For this particular case, where the changes are technical, I think almost everything that could be tested got somehow tested already by the unit tests of the package (the actual running of the clients on collisions data and the python compilation for all the other run keys, see multiple failures in the history of the tests above).
Do you have in mind any other possible failure scenarios that the unit tests are unable to probe?
If that's the case it would be good to updated the testing suite to be able to decouple the cmssw work from testing at P5 (while that's still certainly necessary for other reasons) -- that incidentally would also relieve the DQM Docs of some work.

@antoniovagnerini
Copy link

the playback machines are down due to the programmed YETs, so we will be able to test only after the Christmas break

On the other hand, blocking the development of other groups for weeks does not seem like a good approach :-/

I agree, but unfortunately it is not up to us (DQM core team) as the end-of-year shutdown of the DQM playback machines is managed directly by the system admins. In particular the DQM BU/FU machine machines needed are the builder units dqmrubu-c2a06-03-01, and the filter units dqmfu-c2b01-45-01 and dqmfu-c2b02-45-01, for further details see the DQM mirror webpage for the client monitoring https://cmsweb.cern.ch/dqm/dqm-square/?db=playback

@antoniovagnerini
Copy link

Since this PR affects the DQM clients, from DQM side we need to carry out the usual P5 tests. However, the playback machines are down due to the programmed YETs, so we will be able to test only after the Christmas break.

For this particular case, where the changes are technical, I think almost everything that could be tested got somehow tested already by the unit tests of the package (the actual running of the clients on collisions data and the python compilation for all the other run keys, see multiple failures in the history of the tests above). Do you have in mind any other possible failure scenarios that the unit tests are unable to probe? If that's the case it would be good to updated the testing suite to be able to decouple the cmssw work from testing at P5 (while that's still certainly necessary for other reasons) -- that incidentally would also relieve the DQM Docs of some work.

I also think that for this particular instance of the technical PR most failure modes should be covered by the unit tests. On the other hand, this is a good suggestion as we might want to migrate the testing as much as possible to CMSSW from the P5 tests, so we will open an issue to discuss this in a separate thread.

@fwyzard
Copy link
Contributor

fwyzard commented Dec 18, 2024

The machines

  • dqmrubu-c2a06-03-01
  • dqmfu-c2b01-45-01
  • dqmfu-c2b02-45-01

are now powered on, with the agreement of the sysadmins.

Can you run any tests you need to in order to sign the PR ?

Also, let me know if they should be powered off at the end of the week, or kept on until January 7th.

@antoniovagnerini
Copy link

The machines

* `dqmrubu-c2a06-03-01`

* `dqmfu-c2b01-45-01`

* `dqmfu-c2b02-45-01`

are now powered on, with the agreement of the sysadmins.

Can you run any tests you need to in order to sign the PR ?

Also, let me know if they should be powered off at the end of the week, or kept on until January 7th.

Thank you Andrea, we will carry out the tests today. At this point, we would profit from having them kept on until January 7th, should we need to test other PRs.

@antoniovagnerini
Copy link

+1

  • P5 tests successful

@AdrianoDee
Copy link
Contributor Author

Hi @cms-sw/alca-l2 any comments on this? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment