Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU workflow to runTheMatrix #35263

Merged
merged 12 commits into from
Sep 21, 2021

Conversation

srimanob
Copy link
Contributor

@srimanob srimanob commented Sep 14, 2021

PR description:

The original PR/discussion is #33057 and dmwm/WMCore#10388 from the WMCore side.
Information on GPU support from WMCore side: https://github.com/dmwm/WMCore/wiki/GPU-Support

This PR is to add GPU workflow to runTheMatrix.
In addition, this PR migrates from optparse to argparse, with code clean up.

What to be done with this PR:
(Done)

  • How can I get the CUDARuntime for each release, to be the default value? @makortel @fwyzard @smuzaffar (Done)
  • Is GPUParams as expected from WMCore? I just use jsons.dump(the dictionary) as suggested. @amaltaro (Confirmed)

(Pending)

Thanks.

FYI @dpiparo @davidlange6 @justinasr

PR validation:

Using
runTheMatrix.py --what upgrade -l 11650.502 -t 8 --requires-gpu required --wm init
or
runTheMatrix.py --what upgrade -l 11650.502 -t 8 --gpu --wm init
you will get

'GPUParams': '{"GPUMemoryMB": 8000, "CUDACapabilities": ["6.0", '
                        '"6.1", "6.2", "7.0", "7.2", "7.5"], "CUDARuntime": '
                        '"11.2"}',
'RequiresGPU': 'required',

if this PR is a backport please specify the original PR and why you need to backport that PR:

This PR is not a backport.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35263/25249

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @srimanob (Phat Srimanobhas) for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (pdmv, upgrade)

@jordan-martins, @chayanit, @bbilin, @wajidalikhan, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald, @fabiocos, @slomeo, @kpedro88 this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@srimanob
Copy link
Contributor Author

@cmsbuild please test

@fwyzard
Copy link
Contributor

fwyzard commented Sep 14, 2021

* How can I get the `CUDARuntime` for each release, to be the default value?

Good question.

One option is to read it from the version of the CUDA library bundled with CMSSW, e.g.:

ls $(scram tool tag cuda LIBDIR)/libcudart.so.*.*.* | sed -e 's/.*libcudart.so.//'

Another option is to add it as variable to SCRAM.

Another good question is: what version should we use here ? 11.2.152 or 11.2 ?

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make all options lowercase and use a dash to separate words, like --gpu-version.

Configuration/PyReleaseValidation/scripts/runTheMatrix.py Outdated Show resolved Hide resolved
@@ -305,6 +334,41 @@ def runSelected(opt):
default=None,
action='store')

parser.add_option('--RequiresGPU',
help='if GPU is reuired or not: forbidden (default, CPU-only), optional, required. For relvals, the GPU option will be turned off for optional.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuired -> required
Also, why the special treatment for relvals ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For relvals, I assume that we always control what relvals should run on, w/ or w/o (specific) GPU. So optional may not fit with the relvals. But this can be changed, no solid reason here.

dest='CUDADriverVersion',
default='')

parser.add_option('--CUDARuntimeVersion',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check, but I don't think this should ever be set on the client side ?

Configuration/PyReleaseValidation/scripts/runTheMatrix.py Outdated Show resolved Hide resolved
@amaltaro
Copy link

Besides the GPUMemoryMB comment that I made along the code, all the rest looks good to me, Phat.
I think a plus to this PR would be if the optional GPUParams would be skipped in case they are not defined (with a value different than its default). Providing them with the default empty string should be supported though.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b0c6cc/18586/summary.html
COMMIT: 7d8f062
CMSSW: CMSSW_12_1_X_2021-09-13-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35263/18586/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1299 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3001001
  • DQMHistoTests: Total failures: 3671
  • DQMHistoTests: Total nulls: 19
  • DQMHistoTests: Total successes: 2997289
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -45.703 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 140.53 ): -44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): -1.172 KiB RPC/DCSInfo
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35263/25256

@cmsbuild
Copy link
Contributor

Pull request #35263 was updated. @jordan-martins, @makortel, @chayanit, @bbilin, @wajidalikhan, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen, @fwyzard can you please check and sign again.

@srimanob
Copy link
Contributor Author

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b0c6cc/18634/summary.html
COMMIT: de6a246
CMSSW: CMSSW_12_1_X_2021-09-14-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35263/18634/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000833
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3000805
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 38 files compared)
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@srimanob
Copy link
Contributor Author

@fwyzard @kpedro88
Do you have additional comments to this PR? Thanks.

Maybe we can merge, and open the next PR to improve runThematrix.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 17, 2021

Maybe we can merge, and open the next PR to improve runThematrix.

sure

@fwyzard
Copy link
Contributor

fwyzard commented Sep 17, 2021

+heterogeneous

@srimanob
Copy link
Contributor Author

+Upgrade

@qliphy
Copy link
Contributor

qliphy commented Sep 21, 2021

@cms-sw/pdmv-l2 Do you have any comment?

@kskovpen
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Sep 21, 2021

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants