Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pilot PR for the GPU attributes in workflow injected by runTheMatrix #33057

Closed

Conversation

srimanob
Copy link
Contributor

@srimanob srimanob commented Mar 3, 2021

PR description:

Backport of #33538
(But this is original PR with discussions on how the workflow will look like)

This PR is to converge on how the workflow with GPU attributes will look like. This follows https://docs.google.com/document/d/150k_VBbja1EK9HlxhXs544T0uhenbad8dnMadyNKlpg/edit?usp=sharing
https://docs.google.com/document/d/1shJAEaPDIWF0S3odHm3SSMERhvlTozyTKP8cgfFcOto/edit?usp=sharing
and on WM side:
dmwm/WMCore#10388

Default attributes when GPU is required are:

'GPUParams': {'CUDACapabilities': ['7.5'],
                         'CUDADriverVersion': '',
                         'CUDARuntime': '11.2',
                         'CUDARuntimeVersion': '',
                         'GPUMemory': '8000',
                         'GPUName': ''},

PR validation:

Please ignore the workflow name I use, we can use anything we want, this is to test the output only.

runTheMatrix.py --what upgrade -l 11650.502 --RequiresGPU required --wm init

give me the following workflow (*). However, uploading is not success yet, as we need to update wmcore to accept new attributes.

(*)

Only viewing request 11650.502
{'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
 'CMSSWVersion': 'CMSSW_11_3_X_2021-04-26-2300',
 'Campaign': 'CMSSW_11_3_X_2021-04-26-2300',
 'ConfigCacheUrl': 'https://cmsweb.cern.ch/couchdb',
 'DQMConfigCacheID': 1041,
 'DQMUploadUrl': 'https://cmsweb.cern.ch/dqm/relval',
 'DbsUrl': 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader',
 'EnableHarvesting': 'True',
 'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
 'Group': 'ppd',
 'Memory': 3000,
 'Multicore': 1,
 'PrepID': 'CMSSW_11_3_X_2021-04-26-2300__1619513550-ZMM_14',
 'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
 'ProcessingVersion': 1,
 'RequestPriority': 500000,
 'RequestString': 'RVCMSSW_11_3_X_2021-04-26-2300ZMM_14',
 'RequestType': 'TaskChain',
 'Requestor': 'srimanob',
 'ScramArch': 'slc7_amd64_gcc900',
 'SizePerEvent': 1234,
 'SubRequestType': 'RelVal',
 'Task1': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
           'ConfigCacheID': 1043,
           'EventStreams': 0,
           'EventsPerJob': 100,
           'EventsPerLumi': 100,
           'GPUParams': None,
           'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
           'KeepOutput': True,
           'Memory': 3000,
           'Multicore': 1,
           'PrimaryDataset': 'RelValZMM_14',
           'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
           'RequestNumEvents': 18000,
           'RequiresGPU': None,
           'Seeding': 'AutomaticSeeding',
           'SplittingAlgo': 'EventBased',
           'TaskName': 'ZMM_14TeV_TuneCP5_2021_GenSim'},
 'Task2': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
           'ConfigCacheID': 1044,
           'EventStreams': 0,
           'GPUParams': None,
           'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
           'InputFromOutputModule': u'FEVTDEBUGoutput',
           'InputTask': 'ZMM_14TeV_TuneCP5_2021_GenSim',
           'KeepOutput': True,
           'LumisPerJob': 10,
           'Memory': 3000,
           'Multicore': 1,
           'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
           'RequiresGPU': None,
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'Digi_2021'},
 'Task3': {'AcquisitionEra': 'CMSSW_11_3_X_2021-04-26-2300',
           'ConfigCacheID': 1042,
           'EventStreams': 0,
           'GPUParams': {'CUDACapabilities': ['7.5'],
                         'CUDADriverVersion': '',
                         'CUDARuntime': '11.2',
                         'CUDARuntimeVersion': '',
                         'GPUMemory': '8000',
                         'GPUName': ''},
           'GlobalTag': u'113X_mcRun3_2021_realistic_v10',
           'InputFromOutputModule': u'FEVTDEBUGHLToutput',
           'InputTask': 'Digi_2021',
           'KeepOutput': True,
           'LumisPerJob': 10,
           'Memory': 3000,
           'Multicore': 1,
           'ProcessingString': u'113X_mcRun3_2021_realistic_v10',
           'RequiresGPU': 'required',
           'SplittingAlgo': 'LumiBased',
           'TaskName': 'Reco_Patatrack_PixelOnlyGPU_2021'},
 'TaskChain': 3,
 'TimePerEvent': 10}

if this PR is a backport please specify the original PR and why you need to backport that PR:

This is a backport of #33538

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 3, 2021

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33057/21363

  • This PR adds an extra 24KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 3, 2021

A new Pull Request was created by @srimanob (Phat Srimanobhas) for master.

It involves the following packages:

Configuration/PyReleaseValidation

@jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please review it and eventually sign? Thanks.
@makortel, @Martin-Grunewald, @fabiocos, @slomeo this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@srimanob
Copy link
Contributor Author

srimanob commented Mar 3, 2021

hold

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 3, 2021

Pull request has been put on hold by @srimanob
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Mar 3, 2021
@srimanob
Copy link
Contributor Author

srimanob commented Mar 4, 2021

Please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 4, 2021

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/13260/summary.html
COMMIT: 613e547
CMSSW: CMSSW_11_3_X_2021-03-03-1500/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/33057/13260/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 37
  • DQMHistoTests: Total histograms compared: 2750983
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2750948
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 36 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 156 log files, 37 edm output root files, 37 DQM output files

@fwyzard
Copy link
Contributor

fwyzard commented Mar 4, 2021

Wouldn’t it be more flexible to have a generic option to pass requirements to WM ?

Something like

runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes 'gpuClass = server, gpuRuntime = cuda, gpuRuntimeVersion >= 11.2, gpuDriverVersion >= 460.32.03, gpuMemory >= 8

or

runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes 'gpuClass = server AND gpuRuntime = cuda AND gpuRuntimeVersion >= 11.2 AND gpuDriverVersion >= 460.32.03 AND gpuMemory >= 8

?

This would avoid having to hard-code the same list of attributes in WM and in the runTheMAtrix command line syntax, and would allow the latter to use any attributes known to WM.

If the syntax is supported in WM, it would also allow a request like

runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes '(gpuRuntime = cuda) AND ((gpuRuntimeVersion >= 11.2) OR (gpuDriverVersion >= 450.80.02) OR (gpuClass == server and gpuDriverVersion > 418.40.04))'

which could hardly be specified using command line options.

@davidlange6
Copy link
Contributor

Can one not derive a number of these requirements from the software environment itself?

@srimanob
Copy link
Contributor Author

srimanob commented Mar 4, 2021

I think if we run in the production mode w/o resource constrain, then it should derive from software environment. However, if we would like to run in a very specific resource, e.g. for validation purpose, we should have a way to specify it. Or we need to communicate and assign manually every time we would like something specific.

Regarding the software environment, I assume if the job lands on a machine with CPU+GPU, we should allow CPU-only workflow to run. Not sure if that can be controlled as cmsDriver is the same. Or we don't need this option.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 4, 2021 via email

@fwyzard
Copy link
Contributor

fwyzard commented Mar 4, 2021

Can one not derive a number of these requirements from the software environment itself?

Not currently (the CUDA scram tool does not have all these information), but we could add them there (for example) or somewhere else that is relevant.

@fwyzard
Copy link
Contributor

fwyzard commented Mar 4, 2021

But it's a valid point, and it made me think of something else: instead of trying to specify all possible combinations of CUDA runtime version, driver version, GPU type, etc. - can we make the "server" side advertise something like a "CUDA supported version" ?

This does assume that some of the information is CMS-specific, so it could be either advertised by the site, or interpreted by our middleware.

For example, let's say we have sites A, B and C, with this hardware and software:

  • site A: Tesla P100 cards, CUDA 9.2, drivers 396.26
  • site B: GeForce 2080 cards, CUDA 10.2, drivers 440.33.01
  • site C: Tesla V100 cards, CUDA 11.0, drivers 450.51.05

From what I understand of the CUDA compatibility guide (https://docs.nvidia.com/deploy/cuda-compatibility/):

  • site A supports CUDA 9.2 (and older) runtime out of the box; the drivers are too old to support the compatibility drivers, so that's all one can use; it could support all recent CUDA versions if it were updated to at least CUDA 10.1 and the 418.39 drivers;
  • site B supports CUDA 10.2 (and older) runtime; the gaming cards do not support the compatibility drivers, so that's all one can use;
  • site C supports CUDA 11.0 (and older) runtime out of the box, and newer (up to and including 11.2.x) via the compatibility drivers.

CMSSW 11.x bundles the current version of the CUDA runtime and compatibility drivers, so on a datacenter class GPU it should be able to run as long as the system drivers are >= 418.39 (but, according to the documentation, not on a GeForce card).

So those three sites would support

site max CUDA version with compatibility drivers
A 9.2 9.2
B 10.2 10.2
C 11.0 11.2

If the values that are used to match the jobs to the sites are CMS-specific, the easiest would for the advertisement to take into account that CMSSW does ship with compatibility drivers, and advertise the last column.

If the values that are used to match the jobs to the sites are generic to all experiments, then the sites should advertise only the second column; it could be the CMS middleware that takes into account the GPU type and drivers version and builds the information in the last column.

Either way, it probably makes more sense to put those information on the WM side rather than in runTheMatrix.py, and it the definition of every job.

What do people think ?

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 4, 2021 via email

@rappoccio
Copy link
Contributor

rappoccio commented Mar 4, 2021

Hi, Folks,

(BTW this conversation is happening both in this email thread, and on the PR, and there are some people not in common, so this is getting complicated to track. I guess I will respond in both places.)

From the PPD side, it will be abundantly easier if we have sensible defaults (i.e. we set everything to "None") unless explicitly requested by expert users as is currently in Phat's PR. If we have MC request managers, etc, putting in extremely complicated workflow definitions it will be a recipe for disaster. Highly nontrivial "magic incantations" like this (*) are a guarantee it will be broken ;).

Can we find a solution such that there is some set of defaults coded somewhere? Maybe we can specify configurations that actually exist somewhere? Like

MyFavoriteSite:

 'GPUClass': 'server',
 'GPUDriverVersion': '460.32.03',
 'GPUMemory': '8',
 'GPURuntime': 'cuda',
 'GPURuntimeVersion': '11.2',

MyLeastFavoriteSite:

 'GPUClass': 'server',
 'GPUDriverVersion': '456.32.03',
 'GPUMemory': '8',
 'GPURuntime': 'vidia',
 'GPURuntimeVersion': '11.46',

Then we have a single option like --gpuConfig MyLeastFavoriteSite

etc?

Cheers,
Sal

(*)

 '(gpuRuntime = cuda) AND ((gpuRuntimeVersion >= 11.2) OR (gpuDriverVersion >= 450.80.02) OR (gpuClass == server and gpuDriverVersion > 418.40.04))'

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 4, 2021 via email

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/14267/summary.html
COMMIT: 54954c4
CMSSW: CMSSW_11_3_X_2021-04-15-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/33057/14267/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2864426
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2864397
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again.

@cmsbuild
Copy link
Contributor

Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again.

@srimanob
Copy link
Contributor Author

Please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/14608/summary.html
COMMIT: bac14f9
CMSSW: CMSSW_11_3_X_2021-04-26-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/33057/14608/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2877046
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2877011
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 37 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

dest='CUDADriverVersion',
default='')

parser.add_option('--CUDARuntimeVersion',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and --CUDARuntime above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDARuntimeVersion is the version of the runtime installed on the machine.
CUDARuntime matches with CUDACompatibleRuntimes of node. I follow what is described on dmwm/WMCore#10393. Should we change it to match with the node's CUDACompatibleRuntimes ? @amaltaro

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add the discussion in dmwm/WMCore#10388 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now the difference is explained pretty well in #33057 (comment). Maybe a clarification of the difference in the help would be sufficient.

parser.add_option('--CUDACapabilities',
help='to specify CUDA capabilities. Default = 7.5 (for RequiresGPU = required).',
dest='CUDACapabilities',
default='7.5')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why default only to 7.5? I would think to default to all compute capabilities supported by the release. Which then raises two questions (that go somewhat beyond this PR though):

At this point we'd really need one source for the supported compute capabilities, because it is needed also in cudaIsEnabled. An environment variable in cuda-toolfile? (not really my favorite but would be easy)

How to deal with different SCRAM_ARCHs supporting different sets of CUDA compute capabilities? E.g. our ARM build does not seem to support Pascal (6.x) while x86 and PPC do (cuda-flags.file).

Maybe also mention in the help that the value can be comma-separated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will edit the help. However, to handle the default value, please suggest (or we can pick them from somewhere automatically).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 7.5 might be good-enough to get started, I suppose it depends mostly on what kind of hardware we are going to run on in the very near future. For a longer-term solution I opened an issue #33542.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current default should be 6.0,6.1,6.2,7.0,7.2,7.5.

help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc',
dest='testList',
default=None
help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it

Suggested change
help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc',
help='Comma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc',

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, we have fews. We should not be in coma anymore :)

parser.add_option('--CUDARuntime',
help='to specify CUDA runtime. Default = 11.2 (for RequiresGPU= required).',
dest='CUDARuntime',
default='11.2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also default to whatever the release uses?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that can be done. I just put the default one here to make sure that it will not be an empty field when GPU is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

scram tool info cuda | sed -n -e's/^Version *: *\([[:digit:]]\+\.[[:digit:]]\+\)\.[[:digit:]]\+/\1/p'

Though maybe we should add an environment variable in cuda.spec for this ?

@cmsbuild
Copy link
Contributor

Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again.

@srimanob
Copy link
Contributor Author

Close and will remake in master after converging on the workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants