How to provide supported CUDA compute capabilities and runtime version where needed? #33542

makortel · 2021-04-27T14:41:28Z

Currently the cudaIsEnabled uses a hardcoded condition for the CUDA compute capability check.

#33057 adds a need to have the list of supported compute capabilities to runTheMatrix as well as the CUDA runtime version (two leading parts if I understood correctly, e.g. 11.2).

At this point we should look into having a single source of this information that would scale to adding other pieces of information and also to other technologies beyond CUDA.

The text was updated successfully, but these errors were encountered:

makortel · 2021-04-27T14:41:36Z

assign core,heterogeneous,pdmv

cmsbuild · 2021-04-27T14:41:42Z

New categories assigned: heterogeneous,core,pdmv

@Dr15Jones,@smuzaffar,@jordan-martins,@fwyzard,@chayanit,@wajidalikhan,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2021-04-27T14:41:44Z

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2021-04-27T14:43:52Z

I can think of adding environment variables for these in the CUDA toolfile, but is that the best solution?

smuzaffar · 2021-04-27T14:47:48Z

one other way could be that cuda.spec generates a python script which one can import to get cuda and supported compute capabilities information

makortel · 2021-04-27T14:50:56Z

(to repeat also here) one question for the runTheMatrix use case is that different CPU platforms may support different sets of CUDA compute capabilities (e.g. currently ARM does not support Pascal 6.x, whereas x86 and PPC do). When we start to use multiple CPU architectures I'd imagine the job submission to be done on some/any CPU node (e.g. x86 on lxplus).

How to have the correct CUDA compute capability list for ARM jobs? Would the job submission specify the target SCRAM_ARCH explicitly? (in which case we could use arch-specific defaults)

makortel · 2021-04-27T14:51:41Z

cudaIsEnabled is a C++ program, so how about e.g. a JSON file?

smuzaffar · 2021-04-27T15:06:58Z

json is also possible but then we still need an environment variable to point to it. Can we convert cudaIsEnabled in to python?

makortel · 2021-04-27T15:20:37Z

I suppose it could be reimplemented in python by interpreting the output of cudaComputeCapabilities that would still communicate to CUDA via C(++) API.

makortel · 2021-04-27T15:23:46Z

Or we remove cudaIsEnabled altogether and change https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/CUDACore/python/SwitchProducerCUDA.py#L9 to call cudaComputeCapabilities directly if it would be ok to import that python fragment in CMSSW configuration.

fwyzard · 2021-04-27T16:00:52Z

Note that the compute capabilities that we have in CMSDIST's cuda-flags.file are those we compile for, but they are not an exhaustive list of those we can run on.

Currently (on x86 and power) we build for 6.0, 7.0, 7.5.
Most likely the code can then run on any 6.x and 7.x GPU (and fail on any 5.x and 8.x GPUs).

fwyzard · 2021-04-27T16:14:08Z

I'm confused what you mean with using cudaComputeCapabilities ?

cudaComputeCapabilities prints the GPUs available on the current system, along with their compute capabilities - but it doesn't know anything about the capabilities that CMSSW supports or not.

makortel · 2021-04-27T18:57:47Z

Note that the compute capabilities that we have in CMSDIST's cuda-flags.file are those we compile for, but they are not an exhaustive list of those we can run on.

Right, so we'd need to provide an explicit list of the compute capabilities where the software can run on (or that's what I understood #33057 would need).

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.

(I think this is still heavily on "what we could do" side rather than "should do")

fwyzard · 2021-04-27T19:55:30Z

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.

(I think this is still heavily on "what we could do" side rather than "should do")

Ah, I see, you mean to decide whether "CUDA is available" locally, independently from the job description/matching.

makortel · 2021-04-27T20:08:40Z

I'm confused what you mean with using cudaComputeCapabilities ?

In python program (be it a separate script or part of the configuration system) execute the program, parse the compute capabilities of the devices from its output, and compare that to the contents of the python fragment generated in cuda.spec.
(I think this is still heavily on "what we could do" side rather than "should do")

Ah, I see, you mean to decide whether "CUDA is available" locally, independently from the job description/matching.

Right, this was for cudaIsEnabled/ SwitchProducerCUDA purpose.

fwyzard · 2021-04-28T08:54:06Z

OK. Then I'd suggest we should autodetect which GPUs are usable based on if we are actually able to use them: #33561 .

makortel · 2021-04-28T13:44:19Z

I agree, #33561 provides much better way for cudaIsEnabled (thanks!). That leaves the need to communicate the compute capabilities where can run and the runtime version only to runTheMatrix.

fwyzard · 2021-04-28T15:42:23Z

If we merge cms-sw/cmsdist#6851 , then we will only need to give a "minimum compute capability".

Maybe we can add that value to cuda-flags.file and export via an environment variable or via scram tool tag cuda SOMETHING ?

cmsbuild added core-pending heterogeneous-pending pdmv-pending pending-signatures labels Apr 27, 2021

makortel mentioned this issue Apr 27, 2021

Pilot PR for the GPU attributes in workflow injected by runTheMatrix #33057

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to provide supported CUDA compute capabilities and runtime version where needed? #33542

How to provide supported CUDA compute capabilities and runtime version where needed? #33542

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021

cmsbuild commented Apr 27, 2021

cmsbuild commented Apr 27, 2021

makortel commented Apr 27, 2021

smuzaffar commented Apr 27, 2021 •

edited

Loading

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021

smuzaffar commented Apr 27, 2021

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021 •

edited

Loading

fwyzard commented Apr 27, 2021

fwyzard commented Apr 27, 2021

makortel commented Apr 27, 2021

fwyzard commented Apr 27, 2021

makortel commented Apr 27, 2021

fwyzard commented Apr 28, 2021

makortel commented Apr 28, 2021

fwyzard commented Apr 28, 2021 •

edited

Loading

How to provide supported CUDA compute capabilities and runtime version where needed? #33542

How to provide supported CUDA compute capabilities and runtime version where needed? #33542

Comments

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021

cmsbuild commented Apr 27, 2021

cmsbuild commented Apr 27, 2021

makortel commented Apr 27, 2021

smuzaffar commented Apr 27, 2021 • edited Loading

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021

smuzaffar commented Apr 27, 2021

makortel commented Apr 27, 2021

makortel commented Apr 27, 2021 • edited Loading

fwyzard commented Apr 27, 2021

fwyzard commented Apr 27, 2021

makortel commented Apr 27, 2021

fwyzard commented Apr 27, 2021

makortel commented Apr 27, 2021

fwyzard commented Apr 28, 2021

makortel commented Apr 28, 2021

fwyzard commented Apr 28, 2021 • edited Loading

smuzaffar commented Apr 27, 2021 •

edited

Loading

makortel commented Apr 27, 2021 •

edited

Loading

fwyzard commented Apr 28, 2021 •

edited

Loading