-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pilot PR for the GPU attributes in workflow injected by runTheMatrix #33057
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-33057/21363
|
A new Pull Request was created by @srimanob (Phat Srimanobhas) for master. It involves the following packages: Configuration/PyReleaseValidation @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
hold |
Pull request has been put on hold by @srimanob |
Please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/13260/summary.html Comparison SummarySummary:
|
Wouldn’t it be more flexible to have a generic option to pass requirements to WM ? Something like runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes 'gpuClass = server, gpuRuntime = cuda, gpuRuntimeVersion >= 11.2, gpuDriverVersion >= 460.32.03, gpuMemory >= 8 or runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes 'gpuClass = server AND gpuRuntime = cuda AND gpuRuntimeVersion >= 11.2 AND gpuDriverVersion >= 460.32.03 AND gpuMemory >= 8 ? This would avoid having to hard-code the same list of attributes in WM and in the runTheMAtrix command line syntax, and would allow the latter to use any attributes known to WM. If the syntax is supported in WM, it would also allow a request like runTheMatrix.py --what upgrade -l 23234.0 -b 'HelloGPU' --label 'HelloGPU' --wm force --wmAttributes '(gpuRuntime = cuda) AND ((gpuRuntimeVersion >= 11.2) OR (gpuDriverVersion >= 450.80.02) OR (gpuClass == server and gpuDriverVersion > 418.40.04))' which could hardly be specified using command line options. |
Can one not derive a number of these requirements from the software environment itself? |
I think if we run in the production mode w/o resource constrain, then it should derive from software environment. However, if we would like to run in a very specific resource, e.g. for validation purpose, we should have a way to specify it. Or we need to communicate and assign manually every time we would like something specific. Regarding the software environment, I assume if the job lands on a machine with CPU+GPU, we should allow CPU-only workflow to run. Not sure if that can be controlled as cmsDriver is the same. Or we don't need this option. |
I specifically mean things like minimum cuda runtime version (eg, once a cuda device is required by the config options)
… On Mar 4, 2021, at 12:45 PM, Phat Srimanobhas ***@***.***> wrote:
I think if we run in the production mode w/o resource constrain, then it should derive from software environment. However, if we would like to run in a very specific resource, e.g. for validation purpose, we should have a way to specify it. Or we need to communicate and assign manually every time we would like something specific.
Regarding the software environment, I assume if the job lands on a machine with CPU+GPU, we should allow CPU-only workflow to run. Not sure if that can be controlled as cmsDriver is the same. Or we don't need this option.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Not currently (the CUDA scram tool does not have all these information), but we could add them there (for example) or somewhere else that is relevant. |
But it's a valid point, and it made me think of something else: instead of trying to specify all possible combinations of CUDA runtime version, driver version, GPU type, etc. - can we make the "server" side advertise something like a "CUDA supported version" ? This does assume that some of the information is CMS-specific, so it could be either advertised by the site, or interpreted by our middleware. For example, let's say we have sites
From what I understand of the CUDA compatibility guide (https://docs.nvidia.com/deploy/cuda-compatibility/):
CMSSW 11.x bundles the current version of the CUDA runtime and compatibility drivers, so on a datacenter class GPU it should be able to run as long as the system drivers are >= 418.39 (but, according to the documentation, not on a GeForce card). So those three sites would support
If the values that are used to match the jobs to the sites are CMS-specific, the easiest would for the advertisement to take into account that CMSSW does ship with compatibility drivers, and advertise the last column. If the values that are used to match the jobs to the sites are generic to all experiments, then the sites should advertise only the second column; it could be the CMS middleware that takes into account the GPU type and drivers version and builds the information in the last column. Either way, it probably makes more sense to put those information on the WM side rather than in runTheMatrix.py, and it the definition of every job. What do people think ? |
On Mar 4, 2021, at 1:44 PM, Andrea Bocci ***@***.***> wrote:
But it's a valid point, and it made me think of something else: instead of trying to specify all possible combinations of CUDA runtime version, driver version, GPU type, etc. - can we make the "server" side advertise something like a "CUDA supported version" ?
This does assume that some of the information is CMS-specific, so it could be either advertised by the site, or interpreted by our middleware.
For example, let's say we have sites A, B and C, with this hardware and software:
• site A: Tesla P100 cards, CUDA 9.2, drivers 396.26
• site B: GeForce 2080 cards, CUDA 10.2, drivers 440.33.01
• site C: Tesla V100 cards, CUDA 11.0, drivers 450.51.05
From what I understand of the CUDA compatibility guide (https://docs.nvidia.com/deploy/cuda-compatibility/):
• site A supports CUDA 9.2 (and older) runtime out of the box; the drivers are too old to support the compatibility drivers, so that's all one can use; it could support all recent CUDA versions if it were updated to at least CUDA 10.1 and the 418.39 drivers;
• site B supports CUDA 10.2 (and older) runtime; the gaming cards do not support the compatibility drivers, so that's all one can use;
• site C supports CUDA 11.0 (and older) runtime out of the box, and newer (up to and including 11.2.x) via the compatibility drivers.
CMSSW 11.x bundles the current version of the CUDA runtime and compatibility drivers, so on a datacenter class GPU it should be able to run as long as the system drivers are >= 418.39 (but, according to the documentation, not on a GeForce card).
So those three sites would support
site max CUDA version with compatibility drivers
A 9.2 9.2
B 10.2 10.2
C 11.0 11.2
If the values that are used to match the jobs to the sites are CMS-specific, the easiest would for the advertisement to take into account that CMSSW does ship with compatibility drivers, and advertise the last column.
If the values that are used to match the jobs to the sites are generic to all experiments, then the sites should advertise only the second column; it could be the CMS middleware that takes into account the GPU type and drivers version and builds the information in the last column.
Either way, it probably makes more sense to put those information on the WM side rather than in runTheMatrix.py, and it the definition of every job.
What do people think ?
I agree. CMSSW / the job submitter is the one that knows what is actually used to build and any special requirements on GPU type. So as you suggest, a job could specify what CUDA version, that compatibility drivers are included, and job specific GPU type/memory requirements.
I looked around a bit at attributes in use
CERN batch currently has options like
regexp("V100", TARGET.CUDADeviceName)
TARGET.CUDACapability =?= 7.5
And via htcondor it looks like sites will advertise attributes including
DriverVersion:
RuntimeVersion:
DeviceName:
Capability:
Etc
(Which must get prepended by CUDA for NVIDIA GPUs to match with the htcondor attributes I found in the CERN docs)
(https://research.cs.wisc.edu/htcondor/manual/v8.5.1/12_Appendix_A.html)
…
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi, Folks, (BTW this conversation is happening both in this email thread, and on the PR, and there are some people not in common, so this is getting complicated to track. I guess I will respond in both places.) From the PPD side, it will be abundantly easier if we have sensible defaults (i.e. we set everything to "None") unless explicitly requested by expert users as is currently in Phat's PR. If we have MC request managers, etc, putting in extremely complicated workflow definitions it will be a recipe for disaster. Highly nontrivial "magic incantations" like this (*) are a guarantee it will be broken ;). Can we find a solution such that there is some set of defaults coded somewhere? Maybe we can specify configurations that actually exist somewhere? Like MyFavoriteSite:
MyLeastFavoriteSite:
Then we have a single option like --gpuConfig MyLeastFavoriteSite etc? Cheers, (*)
|
Best to avoid private threads on stuff that affects a bunch of people…
But I think the proposals of this thread are in the spirit of users needing to specify what is specific to their workflow (which they know and others won’t)
… On Mar 4, 2021, at 3:49 PM, rappoccio ***@***.***> wrote:
Hi, Folks,
(BTW this conversation is happening both in this email thread, and on the PR, and there are some people not in common, so this is getting complicated to track. I guess I will respond in both places.)
From the PPD side, it will be abundantly easier if we have sensible defaults (i.e. we set everything to "None") unless explicitly requested by expert users as is currently in Phat's PR. If we have MC request managers, etc, putting in extremely complicated workflow definitions it will be a recipe for disaster. Highly nontrivial "magic incantations" like this (*) are a guarantee it will be broken ;).
Can we find a solution such that there is some set of defaults coded somewhere? Maybe we can specify configurations that actually exist somewhere? Like
MyFavoriteSite:
'GPUClass': 'server',
'GPUDriverVersion': '460.32.03',
'GPUMemory': '8',
'GPURuntime': 'cuda',
'GPURuntimeVersion': '11.2',
MyLeastFavoriteSite:
'GPUClass': 'server',
'GPUDriverVersion': '456.32.03',
'GPUMemory': '8',
'GPURuntime': 'vidia',
'GPURuntimeVersion': '11.46',
Then we have a single option like --gpuConfig MyLeastFavoriteSite
etc?
Cheers,
Sal
(*) '(gpuRuntime = cuda) AND ((gpuRuntimeVersion >= 11.2) OR (gpuDriverVersion >= 450.80.02) OR (gpuClass == server and gpuDriverVersion > 418.40.04))'
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/14267/summary.html Comparison SummarySummary:
|
54954c4
to
13545c7
Compare
Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again. |
Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again. |
Please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bcae87/14608/summary.html Comparison SummarySummary:
|
dest='CUDADriverVersion', | ||
default='') | ||
|
||
parser.add_option('--CUDARuntimeVersion', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and --CUDARuntime
above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDARuntimeVersion is the version of the runtime installed on the machine.
CUDARuntime matches with CUDACompatibleRuntimes of node. I follow what is described on dmwm/WMCore#10393. Should we change it to match with the node's CUDACompatibleRuntimes ? @amaltaro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add the discussion in dmwm/WMCore#10388 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now the difference is explained pretty well in #33057 (comment). Maybe a clarification of the difference in the help would be sufficient.
parser.add_option('--CUDACapabilities', | ||
help='to specify CUDA capabilities. Default = 7.5 (for RequiresGPU = required).', | ||
dest='CUDACapabilities', | ||
default='7.5') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why default only to 7.5
? I would think to default to all compute capabilities supported by the release. Which then raises two questions (that go somewhat beyond this PR though):
At this point we'd really need one source for the supported compute capabilities, because it is needed also in cudaIsEnabled
. An environment variable in cuda-toolfile
? (not really my favorite but would be easy)
How to deal with different SCRAM_ARCH
s supporting different sets of CUDA compute capabilities? E.g. our ARM build does not seem to support Pascal (6.x) while x86 and PPC do (cuda-flags.file).
Maybe also mention in the help that the value can be comma-separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will edit the help. However, to handle the default value, please suggest (or we can pick them from somewhere automatically).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 7.5
might be good-enough to get started, I suppose it depends mostly on what kind of hardware we are going to run on in the very near future. For a longer-term solution I opened an issue #33542.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current default should be 6.0,6.1,6.2,7.0,7.2,7.5
.
help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc', | ||
dest='testList', | ||
default=None | ||
help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it
help='Coma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc', | |
help='Comma separated list of workflow to be shown or ran. Possible keys are also '+str(predefinedSet.keys())+'. and wild card like muon, or mc', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, we have fews. We should not be in coma anymore :)
parser.add_option('--CUDARuntime', | ||
help='to specify CUDA runtime. Default = 11.2 (for RequiresGPU= required).', | ||
dest='CUDARuntime', | ||
default='11.2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this also default to whatever the release uses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that can be done. I just put the default one here to make sure that it will not be an empty field when GPU is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like
scram tool info cuda | sed -n -e's/^Version *: *\([[:digit:]]\+\.[[:digit:]]\+\)\.[[:digit:]]\+/\1/p'
Though maybe we should add an environment variable in cuda.spec
for this ?
Pull request #33057 was updated. @jordan-martins, @chayanit, @wajidalikhan, @kpedro88, @cmsbuild, @srimanob can you please check and sign again. |
Close and will remake in master after converging on the workflow. |
PR description:
Backport of #33538
(But this is original PR with discussions on how the workflow will look like)
This PR is to converge on how the workflow with GPU attributes will look like. This follows https://docs.google.com/document/d/150k_VBbja1EK9HlxhXs544T0uhenbad8dnMadyNKlpg/edit?usp=sharing
https://docs.google.com/document/d/1shJAEaPDIWF0S3odHm3SSMERhvlTozyTKP8cgfFcOto/edit?usp=sharing
and on WM side:
dmwm/WMCore#10388
Default attributes when GPU is required are:
PR validation:
Please ignore the workflow name I use, we can use anything we want, this is to test the output only.
runTheMatrix.py --what upgrade -l 11650.502 --RequiresGPU required --wm init
give me the following workflow (*). However, uploading is not success yet, as we need to update wmcore to accept new attributes.
(*)
if this PR is a backport please specify the original PR and why you need to backport that PR:
This is a backport of #33538