add GPU RelVals using 2023 HLT menu #41354

missirol · 2023-04-17T07:14:59Z

PR description:

This PR is an attempt to add GPU RelVals making use of the 2023 HLT menu. The goal is to have wfs that run the latest HLT pp menu for 2023 on machines with a GPU.

Workflows are added for both MC and data (using data from 2022), trying to follow the structure of the existing GPU RelVals.

The next step would be to change the default GPU wfs in PR tests (here) to use 2023 ones.

PR validation:

Some of the added workflows pass locally.

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

If approved, it should be backported to CMSSW_13_0_X.

cmsbuild · 2023-04-17T07:23:29Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41354/35194

This PR adds an extra 60KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/relval_gpu.py modified in PR(s): Guard for No Digis for Phase2 and Patatrack NuGun #41255

cmsbuild · 2023-04-17T07:23:53Z

A new Pull Request was created by @missirol (Marino Missiroli) for master.

It involves the following packages:

Configuration/PyReleaseValidation (pdmv, upgrade)

@bbilin, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @sunilUIET can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald, @fabiocos, @slomeo, @kpedro88 this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

missirol · 2023-04-17T07:24:12Z

test parameters:

enable = gpu
workflows_gpu = 12450.502,12450.503,12450.504,12450.506,12450.507,12450.508,12434.502,12434.503,12434.504,12434.506,12434.507,12434.508,12434.512,12434.513,12434.514,12434.522,12434.523,12434.524,12434.582,12434.583,12434.586,12434.587,12434.592,12434.593,12434.596,12434.597,140.065502,140.065512,140.065522
workflows = 140.065501,140.065511,140.065521

missirol · 2023-04-17T07:24:44Z

@fwyzard, could you please review this PR? (I don't know these wfs well)

fwyzard · 2023-04-17T22:15:14Z

Configuration/PyReleaseValidation/python/relval_gpu.py

+#           Patatrack ECAL-only:                RunJetMET2022D on GPU (optional)
+#           Patatrack HCAL-only:                RunJetMET2022D on GPU (optional)
+
+workflows[140.065502] = ['Run3-2023_JetMET2022D_RecoPixelOnlyGPU',['RunJetMET2022D','HLTDR3_2023','RECODR3_reHLT_Patatrack_PixelOnlyGPU','HARVESTRUN3_pixelTrackingOnly']]


I would use triplets by default for 2023, so 140.065506 ?

fwyzard · 2023-04-17T22:15:43Z

Configuration/PyReleaseValidation/python/relval_standard.py

@@ -494,6 +494,11 @@
 workflows[140.068] = ['',['RunTau2022D','HLTDR3_2023','RECONANORUN3_reHLT','HARVESTRUN3']]
 workflows[140.069] = ['',['RunMuonEG2022D','HLTDR3_2023','RECONANORUN3_reHLT','HARVESTRUN3']]

+### run3-2023 (2022 data) - Pixel-only, ECAL-only and HCAL-only
+workflows[140.065501] = ['Run3-2023_JetMET2022D_RecoPixelOnlyCPU',['RunJetMET2022D','HLTDR3_2023','RECODR3_reHLT_Patatrack_PixelOnlyCPU','HARVESTRUN3_pixelTrackingOnly']]


I would use triplets by default for 2023, so 140.065505 ?

fwyzard · 2023-04-17T22:17:17Z

please test

fwyzard · 2023-04-17T22:18:20Z

From looking at the diff, the changes seem OK.
I think the best way to be sure is to run the new workflows and check that they are running the correct modules.

cmsbuild · 2023-04-18T00:53:22Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1a45e3/32011/summary.html
COMMIT: f8f5e37
CMSSW: CMSSW_13_1_X_2023-04-17-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41354/32011/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 9 lines from the logs
Reco comparison results: 18 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3459609
DQMHistoTests: Total failures: 12
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3459575
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 207 log files, 159 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19862
DQMHistoTests: Total failures: 9
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19853
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

fwyzard · 2023-04-18T07:13:16Z

From the test results we can see that the gpu workflow did run on a gpu.

cmsbuild · 2023-04-18T09:38:52Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41354/35214

This PR adds an extra 60KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File Configuration/PyReleaseValidation/python/relval_gpu.py modified in PR(s): Guard for No Digis for Phase2 and Patatrack NuGun #41255

cmsbuild · 2023-04-18T18:49:27Z

Pull request #41354 was updated. @bbilin, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @sunilUIET can you please check and sign again.

missirol · 2023-04-18T18:52:23Z

please test

The latest push contains a minor update to upgradeWorkflowComponents.py, done to align this PR with its backport (#41371). This should be the final version of the PR.

cmsbuild · 2023-04-18T21:35:23Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1a45e3/32026/summary.html
COMMIT: 1031d03
CMSSW: CMSSW_13_1_X_2023-04-18-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41354/32026/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 31 lines from the logs
Reco comparison results: 14 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3459877
DQMHistoTests: Total failures: 9
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3459846
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 207 log files, 159 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 24 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19870
DQMHistoTests: Total failures: 1005
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 18865
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

fwyzard · 2023-04-19T09:08:54Z

I've run the data workflows from latest version of this PR, and compared the TrigReport for the GPU vs CPU versions, and the differences are as expected:

the CPU versions use the @cpu modules,
the GPU versions use the @cuda modules.

Looks good to me.

missirol · 2023-04-19T09:18:44Z

@fwyzard , thanks for checking (and for reviewing the PR).

@cms-sw/pdmv-l2 @cms-sw/upgrade-l2 , could you please review this PR and its backport (#41371) ?

sunilUIET · 2023-04-19T10:05:27Z

+pdmv

missirol · 2023-04-21T05:59:45Z

@AdrianoDee @srimanob , could you please review this PR and its backport (#41371) ?

srimanob · 2023-04-21T06:07:29Z

+Upgrade

cmsbuild · 2023-04-21T06:07:56Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2023-04-21T11:25:02Z

+1

missirol · 2023-04-22T10:21:11Z

After merging this PR, six of the new workflows [*] failed in the IB CMSSW_13_1_GPU_X_2023-04-21-2300.

The IB error is the same for all 6 wfs ("DAS-Err"). Clicking on the "1" next to DAS-Err, one sees a DAS query [2] that returns 0 results.

Those workflows passed during the PR tests (e.g. here). I'm no expert, but it looks like

in PR tests, the workflow runs the GEN+SIM steps (step 1 of the workflow) [this worked during PR tests];
in IBs, the workflow queries the GEN-SIM file(s) from DAS [this failed in the latest IB].

I think the issue is that the GEN-SIM sample [3] does not exist [4].

@cms-sw/pdmv-l2, could/should [3] be produced? (if it wasn't already)

@cms-sw/orp-l2, should I disable [1] in 13_1_X (new PR) and #41371 (backport), or do we wait for PdmV's reply?

[1] 12450.502,12450.503,12450.504,12450.506,12450.507,12450.508

[2] dasgoclient --limit 0 --query 'file dataset=/RelValZMM_14/CMSSW_12_5_0_pre4-124X_mcRun3_2023_realistic_v11_BS2022-v1/GEN-SIM site=T2_CH_CERN'

[3] /RelValZMM_14/CMSSW_12_5_0_pre4-124X_mcRun3_2023_realistic_v11_BS2022-v1/GEN-SIM

[4] Looking [3] up on the DAS webpage returns a 'dummy' entry with

Dataset size: 0 (0.0) Number of blocks: 0 Number of events: 0 Number of files: 0

I see the same 'dummy' entry if I put a random string (dataset=/DontExist/ForSure/GEN-SIM), meaning a sample that most likely never existed. Looking for "dataset=/RelVal*/*124X_mcRun3_2023_*_BS2022*/GEN-SIM" confirms that [3] does not exist.

missirol · 2023-04-22T14:38:54Z

#41386 removes the 6 problematic wfs from the list of GPU RelVals.

cmsbuild added this to the CMSSW_13_1_X milestone Apr 17, 2023

cmsbuild added pending-signatures tests-pending orp-pending pdmv-pending upgrade-pending code-checks-pending labels Apr 17, 2023

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 17, 2023

fwyzard reviewed Apr 17, 2023

View reviewed changes

cmsbuild added tests-started and removed tests-pending labels Apr 17, 2023

cmsbuild added tests-approved and removed tests-started labels Apr 18, 2023

missirol force-pushed the devel_wfGPU2023 branch from f8f5e37 to 4458799 Compare April 18, 2023 09:33

cmsbuild added tests-pending code-checks-pending and removed tests-approved code-checks-approved labels Apr 18, 2023

cmsbuild removed the code-checks-pending label Apr 18, 2023

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 18, 2023

cmsbuild added tests-started and removed tests-pending labels Apr 18, 2023

cmsbuild added tests-approved and removed tests-started labels Apr 18, 2023

cmsbuild added pdmv-approved and removed pdmv-pending labels Apr 19, 2023

cmsbuild added fully-signed upgrade-approved and removed pending-signatures upgrade-pending labels Apr 21, 2023

cmsbuild added orp-approved and removed orp-pending labels Apr 21, 2023

cmsbuild merged commit 5b697b0 into cms-sw:master Apr 21, 2023

missirol deleted the devel_wfGPU2023 branch April 22, 2023 09:17

missirol mentioned this pull request Apr 22, 2023

Remove 12450.* wfs (2023, ZMuMu) from GPU RelVals #41386

Closed

missirol mentioned this pull request May 2, 2023

Change default wfs for PR tests on GPU in CMSSW_13_X_Y cms-sw/cms-bot#1976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add GPU RelVals using 2023 HLT menu #41354

add GPU RelVals using 2023 HLT menu #41354

missirol commented Apr 17, 2023

cmsbuild commented Apr 17, 2023

cmsbuild commented Apr 17, 2023

missirol commented Apr 17, 2023

missirol commented Apr 17, 2023

fwyzard Apr 17, 2023

fwyzard Apr 17, 2023

fwyzard commented Apr 17, 2023

fwyzard commented Apr 17, 2023

cmsbuild commented Apr 18, 2023

fwyzard commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

missirol commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

fwyzard commented Apr 19, 2023

missirol commented Apr 19, 2023

sunilUIET commented Apr 19, 2023

missirol commented Apr 21, 2023

srimanob commented Apr 21, 2023

cmsbuild commented Apr 21, 2023

perrotta commented Apr 21, 2023

missirol commented Apr 22, 2023

missirol commented Apr 22, 2023

add GPU RelVals using 2023 HLT menu #41354

add GPU RelVals using 2023 HLT menu #41354

Conversation

missirol commented Apr 17, 2023

PR description:

PR validation:

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

cmsbuild commented Apr 17, 2023

cmsbuild commented Apr 17, 2023

missirol commented Apr 17, 2023

missirol commented Apr 17, 2023

fwyzard Apr 17, 2023

Choose a reason for hiding this comment

fwyzard Apr 17, 2023

Choose a reason for hiding this comment

fwyzard commented Apr 17, 2023

fwyzard commented Apr 17, 2023

cmsbuild commented Apr 18, 2023

Comparison Summary

GPU Comparison Summary

fwyzard commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

missirol commented Apr 18, 2023

cmsbuild commented Apr 18, 2023

Comparison Summary

GPU Comparison Summary

fwyzard commented Apr 19, 2023

missirol commented Apr 19, 2023

sunilUIET commented Apr 19, 2023

missirol commented Apr 21, 2023

srimanob commented Apr 21, 2023

cmsbuild commented Apr 21, 2023

perrotta commented Apr 21, 2023

missirol commented Apr 22, 2023

missirol commented Apr 22, 2023