Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add GPU RelVals using 2023 HLT menu #41354

Merged
merged 1 commit into from
Apr 21, 2023
Merged

Conversation

missirol
Copy link
Contributor

PR description:

This PR is an attempt to add GPU RelVals making use of the 2023 HLT menu. The goal is to have wfs that run the latest HLT pp menu for 2023 on machines with a GPU.

Workflows are added for both MC and data (using data from 2022), trying to follow the structure of the existing GPU RelVals.

The next step would be to change the default GPU wfs in PR tests (here) to use 2023 ones.

PR validation:

Some of the added workflows pass locally.

If this PR is a backport, please specify the original PR and why you need to backport that PR. If this PR will be backported, please specify to which release cycle the backport is meant for:

If approved, it should be backported to CMSSW_13_0_X.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41354/35194

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @missirol (Marino Missiroli) for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (pdmv, upgrade)

@bbilin, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @sunilUIET can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald, @fabiocos, @slomeo, @kpedro88 this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

test parameters:

  • enable = gpu
  • workflows_gpu = 12450.502,12450.503,12450.504,12450.506,12450.507,12450.508,12434.502,12434.503,12434.504,12434.506,12434.507,12434.508,12434.512,12434.513,12434.514,12434.522,12434.523,12434.524,12434.582,12434.583,12434.586,12434.587,12434.592,12434.593,12434.596,12434.597,140.065502,140.065512,140.065522
  • workflows = 140.065501,140.065511,140.065521

@missirol
Copy link
Contributor Author

@fwyzard, could you please review this PR? (I don't know these wfs well)

# Patatrack ECAL-only: RunJetMET2022D on GPU (optional)
# Patatrack HCAL-only: RunJetMET2022D on GPU (optional)

workflows[140.065502] = ['Run3-2023_JetMET2022D_RecoPixelOnlyGPU',['RunJetMET2022D','HLTDR3_2023','RECODR3_reHLT_Patatrack_PixelOnlyGPU','HARVESTRUN3_pixelTrackingOnly']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use triplets by default for 2023, so 140.065506 ?

@@ -494,6 +494,11 @@
workflows[140.068] = ['',['RunTau2022D','HLTDR3_2023','RECONANORUN3_reHLT','HARVESTRUN3']]
workflows[140.069] = ['',['RunMuonEG2022D','HLTDR3_2023','RECONANORUN3_reHLT','HARVESTRUN3']]

### run3-2023 (2022 data) - Pixel-only, ECAL-only and HCAL-only
workflows[140.065501] = ['Run3-2023_JetMET2022D_RecoPixelOnlyCPU',['RunJetMET2022D','HLTDR3_2023','RECODR3_reHLT_Patatrack_PixelOnlyCPU','HARVESTRUN3_pixelTrackingOnly']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use triplets by default for 2023, so 140.065505 ?

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2023

please test

@fwyzard
Copy link
Contributor

fwyzard commented Apr 17, 2023

From looking at the diff, the changes seem OK.
I think the best way to be sure is to run the new workflows and check that they are running the correct modules.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1a45e3/32011/summary.html
COMMIT: f8f5e37
CMSSW: CMSSW_13_1_X_2023-04-17-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41354/32011/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 9 lines from the logs
  • Reco comparison results: 18 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3459609
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3459575
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19853
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 18, 2023

From the test results we can see that the gpu workflow did run on a gpu.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-41354/35214

@cmsbuild
Copy link
Contributor

Pull request #41354 was updated. @bbilin, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @sunilUIET can you please check and sign again.

@missirol
Copy link
Contributor Author

please test

The latest push contains a minor update to upgradeWorkflowComponents.py, done to align this PR with its backport (#41371). This should be the final version of the PR.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-1a45e3/32026/summary.html
COMMIT: 1031d03
CMSSW: CMSSW_13_1_X_2023-04-18-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/41354/32026/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially removed 31 lines from the logs
  • Reco comparison results: 14 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3459877
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3459846
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 207 log files, 159 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 24 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19870
  • DQMHistoTests: Total failures: 1005
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 18865
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor

fwyzard commented Apr 19, 2023

I've run the data workflows from latest version of this PR, and compared the TrigReport for the GPU vs CPU versions, and the differences are as expected:

  • the CPU versions use the @cpu modules,
  • the GPU versions use the @cuda modules.

Looks good to me.

@missirol
Copy link
Contributor Author

@fwyzard , thanks for checking (and for reviewing the PR).

@cms-sw/pdmv-l2 @cms-sw/upgrade-l2 , could you please review this PR and its backport (#41371) ?

@sunilUIET
Copy link
Contributor

+pdmv

@missirol
Copy link
Contributor Author

@AdrianoDee @srimanob , could you please review this PR and its backport (#41371) ?

@srimanob
Copy link
Contributor

+Upgrade

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 5b697b0 into cms-sw:master Apr 21, 2023
@missirol missirol deleted the devel_wfGPU2023 branch April 22, 2023 09:17
@missirol
Copy link
Contributor Author

After merging this PR, six of the new workflows [*] failed in the IB CMSSW_13_1_GPU_X_2023-04-21-2300.

The IB error is the same for all 6 wfs ("DAS-Err"). Clicking on the "1" next to DAS-Err, one sees a DAS query [2] that returns 0 results.

Those workflows passed during the PR tests (e.g. here). I'm no expert, but it looks like

  • in PR tests, the workflow runs the GEN+SIM steps (step 1 of the workflow) [this worked during PR tests];
  • in IBs, the workflow queries the GEN-SIM file(s) from DAS [this failed in the latest IB].

I think the issue is that the GEN-SIM sample [3] does not exist [4].

@cms-sw/pdmv-l2, could/should [3] be produced? (if it wasn't already)

@cms-sw/orp-l2, should I disable [1] in 13_1_X (new PR) and #41371 (backport), or do we wait for PdmV's reply?


[1] 12450.502,12450.503,12450.504,12450.506,12450.507,12450.508

[2] dasgoclient --limit 0 --query 'file dataset=/RelValZMM_14/CMSSW_12_5_0_pre4-124X_mcRun3_2023_realistic_v11_BS2022-v1/GEN-SIM site=T2_CH_CERN'

[3] /RelValZMM_14/CMSSW_12_5_0_pre4-124X_mcRun3_2023_realistic_v11_BS2022-v1/GEN-SIM

[4] Looking [3] up on the DAS webpage returns a 'dummy' entry with

Dataset size: 0 (0.0) Number of blocks: 0 Number of events: 0 Number of files: 0

I see the same 'dummy' entry if I put a random string (dataset=/DontExist/ForSure/GEN-SIM), meaning a sample that most likely never existed. Looking for "dataset=/RelVal*/*124X_mcRun3_2023_*_BS2022*/GEN-SIM" confirms that [3] does not exist.

@missirol
Copy link
Contributor Author

#41386 removes the 6 problematic wfs from the list of GPU RelVals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants